Distill Math Reasoning Data from DeepSeek R1 with CAMEL#

You can also check this cookbook in colab here

13fd1981ea00412b8232abb7376e85c7 2c7bdd13f9524f8bb83ab8b3d90e1ae3

⭐ Star us on Github, join our Discord or follow our X

This notebook provides a comprehensive guide on configuring and utilizing CAMEL’s data distillation pipeline to generate high-quality mathematical reasoning datasets featuring detailed thought processes (Long Chain-of-Thought data).

In this notebook, you’ll explore:

  • CAMEL: A powerful multi-agent framework that enables synthetic data generation and multi-agent role-playing scenarios, enabling advanced AI-driven applications.

  • Data distillation pipline: A systematic approach for extracting and refining high-quality reasoning datasets with detailed thought processes from models like DeepSeek R1.

  • Hugging Face Integration: A streamlined process for uploading and sharing distilled datasets on the Hugging Face platform.

Through the use of our synthetic data generation pipeline, CAEML-AI has crafted three comprehensive datasets that are now available to enhance your mathematical reasoning and problem-solving skills. These datasets are hosted on Hugging Face for easy access:

  • πŸ“š AMC AIME STaR Dataset

    A dataset of 4K advanced mathematical problems and solutions, distilled with improvement history showing how the solution was iteratively refined. πŸ”— Explore the Dataset

  • πŸ“š AMC AIME Distilled Dataset

    A dataset of 4K advanced mathematical problems and solutions, distilled with clear step-by-step solutions. πŸ”— Explore the Dataset

  • πŸ“š GSM8K Distilled Dataset

    A dataset of 7K high quality linguistically diverse grade school math word problems and solutions, distilled with clear step-by-step solutions. πŸ”— Explore the Dataset

Perfect for those eager to explore AI-driven problem-solving or dive deep into mathematical reasoning! πŸš€βœ¨