Datagen
CAMEL’s data generation modules for high-quality, instruction-tuned, and reasoning-rich datasets.
This page introduces CAMEL’s data generation modules for creating high-quality training data with explicit reasoning, diverse instructions, and advanced automated refinement.
- Chain of Thought (CoT): Generates explicit reasoning paths
- Self-Instruct: Produces instruction-following data from both humans and machines
- Source2Synth: Synthesizes multi-hop QA from source text or code
- Self-Improving CoT: Iteratively improves reasoning through agent self-critique
Chain of Thought (CoT) Data Generation
Chain of Thought (CoT) data generation creates step-by-step reasoning paths for problem solving, leveraging dual agents and advanced search/verification logic.
Key Features
Key Features
- Monte Carlo Tree Search (MCTS) for solution exploration
- Binary Search Error Detection for precise error localization
- Dual-Agent Verification System for quality assurance
- Solution Tree Management for tracking reasoning paths
Core Components
Core Components
CoTDataGenerator Class
The main class that implements the CoT generation system with the following capabilities:
- Dual-Agent Architecture: Supports both single-agent (legacy) and dual-agent modes
- Answer Generation: Sophisticated answer generation with MCTS
- Answer Verification: Robust verification system using golden answers
- Error Detection: Binary search-based error detection in solutions
- Solution Management: Comprehensive solution tree management and export
Quick Start: CoT Data Generation
Spin up chain-of-thought data generation with dual agents, golden answers, and CoT solution generation:
Data Import/Export for CoT
Easily import question-answer pairs or export generated solutions for further use:
Solution Generation Process
Solution Generation Process
Direct Solution Attempt
First, the agent attempts to solve the problem directly and checks the result against the golden answer for correctness.
MCTS-Based Exploration
If the direct attempt fails, a Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, building a solution tree from previous attempts.
Error Detection & Correction
Binary search is used to efficiently pinpoint and isolate errors in the solution. New solutions are then generated, reusing verified-correct parts.
Solution Verification
All candidate solutions are strictly verified using a dual-agent system or comparison against golden answers to ensure high quality and accuracy.
Configuration Options
Configuration Options
search_limit
: Maximum number of search iterations (default: 100)generator_agent
: Specialized agent for answer generationverifier_agent
: Specialized agent for answer verificationgolden_answers
: Pre-defined correct answers for validation
Output Format
Output Format
The solution tree is exported in JSON format containing:
- Solutions with intermediate steps
- Golden answers used for verification
- Export timestamp
- Solution scores and verification results
Self-Instruct: Instruction Generation
Self-Instruct is a pipeline for generating high-quality, diverse instructions by combining human-written seed tasks and machine-generated prompts, all filtered for quality and diversity.
Key Features
Key Features
- Combines human-written and machine-generated instructions using configurable ratios
- Supports both classification and non-classification task types
- Built-in instruction filtering and validation
- Automatic instance generation for tasks
- JSON-based data input/output
Core Components
Core Components
SelfInstructPipeline – Orchestrates the end-to-end instruction generation, mixing seeds and machine prompts, filtering, and outputting results.
InstructionFilter – Handles validation and filtering of all generated instructions:
- Length-based, keyword, and punctuation checks
- Non-English text detection
- ROUGE similarity filtering for deduplication
- Extensible registry for custom filters
Quick Start: Self-Instruct Generation
Quickly set up an instruction generation pipeline with both human and machine prompts:
Custom Filtering Example
Use custom filters to refine and deduplicate instructions as needed:
Pipeline Stages
Pipeline Stages
Seed Loading
Load and validate human-written instructions from JSONL file; initialize task storage.
Instruction Generation
Sample both human and machine tasks based on your chosen ratio, then generate new instructions with ChatAgent and apply filters.
Task Classification
Automatically determine if tasks are classification or not, and generate the right prompts for each type.
Instance Generation
Generate input-output pairs, format and parse instances, and apply quality filters.
Data Output
Save all generated instructions and their instances to JSON, with metadata and configuration details.
Pipeline Parameters
Pipeline Parameters
agent
: ChatAgent instance for generating instructionsseed
: Path to human-written seed tasks in JSONL formatnum_machine_instructions
: Number of machine-generated instructions (default: 5)data_output_path
: Path for saving generated data (default:./data_output.json
)human_to_machine_ratio
: Ratio of human to machine tasks (default: (6, 2))instruction_filter
: CustomInstructionFilter
instance (optional)filter_config
: Configuration dictionary for default filters (optional)
Filter Configuration
Filter Configuration
The default filter configuration supports:
- length: Configure length constraints for instructions
- keyword: Set up keyword-based filtering rules
- punctuation: Define punctuation validation rules
- non_english: Non-English text detection
- rouge_similarity: Set ROUGE similarity thresholds for deduplication
Input/Output Format
Input/Output Format
Source2Synth: Multi-hop Question-Answer Generation
Source2Synth generates complex multi-hop QA pairs from source text (or code) via an orchestrated pipeline of AI-driven and rule-based steps, with curation and complexity control.
Core Components
Core Components
UserDataProcessor: Orchestrates the full pipeline, from raw text through QA generation and curation.
ExampleConstructor: Builds multi-hop QA examples, extracting premise, intermediate steps, and conclusions.
DataCurator: Filters, deduplicates, and samples the final dataset to match quality and complexity requirements.
Key Features
Key Features
- Batch or single text processing
- Switchable AI or rule-based question generation
- Multi-hop QA and complexity scoring
- Integrated curation, deduplication, and reproducible sampling
- Seamless MultiHopGeneratorAgent integration
Quick Start: Source2Synth Pipeline
Rapidly generate a multi-hop QA dataset from your own text or source files:
ProcessorConfig Parameters
ProcessorConfig Parameters
seed
: Random seed for reproducibilitymin_length
: Minimum text length for processingmax_length
: Maximum text length for processingcomplexity_threshold
: Minimum complexity score (0.0–1.0)dataset_size
: Target size for the final datasetuse_ai_model
: Toggle between AI model and rule-based generationhop_generating_agent
: CustomMultiHopGeneratorAgent
(optional)
Pipeline Stages
Pipeline Stages
Text Preprocessing
Validate text length and quality; standardize for processing.
Information Extraction
Identify premises, extract intermediate facts, and form conclusions.
QA Generation
Generate multi-hop questions, validate answers, and score for complexity.
Dataset Curation
Filter for quality, enforce complexity thresholds, deduplicate, and sample to target size.
Self-Improving CoT Data Generation
This pipeline implements self-taught reasoning—an iterative process where an AI agent refines its own reasoning traces via self-evaluation, feedback, and reward models for continual improvement.
Key Components
Key Components
SelfImprovingCoTPipeline: Implements the STaR (Self-Taught Reasoning) methodology, supporting both agent-based and external reward model evaluation, iterative feedback loops, and flexible output formats.
- Customizable reasoning and evaluation agents
- Support for reward models and custom thresholds
- Few-shot learning and rich output options
Architecture Stages
Architecture Stages
Initial Reasoning Trace Generation
The pipeline generates an initial reasoning path for each problem using the designated agent.
Self-Evaluation
An evaluator agent (or reward model) critically reviews each reasoning trace for quality, clarity, and correctness.
Feedback-Based Improvement
The system refines and re-generates reasoning steps using the evaluation feedback.
Iterative Refinement
This evaluation-feedback loop is repeated for a configurable number of iterations to reach optimal performance.
Quick Start: Self-Improving CoT Pipeline
Launch a self-improving reasoning workflow with just a few lines:
Advanced: External Reward Model Integration
Evaluate and guide reasoning traces with an external reward model, such as Nemotron:
Input/Output Format
Input/Output Format
- Original problem
- Final reasoning trace
- Improvement history with iterations
- Evaluation scores and feedback per iteration
Configuration Options
Configuration Options
max_iterations
: Maximum number of improvement iterations (default: 3)score_threshold
: Minimum quality thresholds for evaluation dimensions (default: 0.7)few_shot_examples
: (Optional) Examples for few-shot learningoutput_path
: (Optional) Path for saving generated results