Distill Math Reasoning Data from DeepSeek R1 with CAMEL#
You can also check this cookbook in colab here
This notebook provides a comprehensive guide on configuring and utilizing CAMELβs data distillation pipeline to generate high-quality mathematical reasoning datasets featuring detailed thought processes (Long Chain-of-Thought data).
In this notebook, youβll explore:
CAMEL: A powerful multi-agent framework that enables synthetic data generation and multi-agent role-playing scenarios, enabling advanced AI-driven applications.
Data distillation pipline: A systematic approach for extracting and refining high-quality reasoning datasets with detailed thought processes from models like DeepSeek R1.
Hugging Face Integration: A streamlined process for uploading and sharing distilled datasets on the Hugging Face platform.
Through the use of our synthetic data generation pipeline, CAEML-AI has crafted three comprehensive datasets that are now available to enhance your mathematical reasoning and problem-solving skills. These datasets are hosted on Hugging Face for easy access:
π AMC AIME STaR Dataset
A dataset of 4K advanced mathematical problems and solutions, distilled with improvement history showing how the solution was iteratively refined. π Explore the Dataset
π AMC AIME Distilled Dataset
A dataset of 4K advanced mathematical problems and solutions, distilled with clear step-by-step solutions. π Explore the Dataset
π GSM8K Distilled Dataset
A dataset of 7K high quality linguistically diverse grade school math word problems and solutions, distilled with clear step-by-step solutions. π Explore the Dataset
Perfect for those eager to explore AI-driven problem-solving or dive deep into mathematical reasoning! πβ¨