Datagen
This document describes CAMEL’s key data generation modules that enable high-quality training data creation through advanced reasoning and instruction tuning techniques. The modules include:
- Chain of Thought (CoT): Generates explicit reasoning paths
- Self-Instruct: Creates diverse instruction-following data
- Source2Synth: Converts source code into natural language
- Self-Improving CoT: Iteratively refines reasoning chains through self-critique
Chain of Thought (CoT) Data Generation
Overview
The Chain of Thought (CoT) data generation module implements a sophisticated system for generating high-quality reasoning paths through chat agent interactions. It combines several advanced algorithms to produce and validate reasoning chains.
Key Features
- Monte Carlo Tree Search (MCTS) for solution exploration
- Binary Search Error Detection for precise error localization
- Dual-Agent Verification System for quality assurance
- Solution Tree Management for tracking reasoning paths
Core Components
CoTDataGenerator Class
The main class that implements the CoT generation system with the following capabilities:
- Dual-Agent Architecture: Supports both single-agent (legacy) and dual-agent modes
- Answer Generation: Sophisticated answer generation with MCTS
- Answer Verification: Robust verification system using golden answers
- Error Detection: Binary search-based error detection in solutions
- Solution Management: Comprehensive solution tree management and export
Usage
Basic Example
Data Import/Export
Solution Generation Process
-
Direct Solution Attempt
- First tries to solve the problem directly
- Verifies against golden answer
-
MCTS-based Exploration
- If direct solution fails, uses MCTS to explore solution space
- Builds solution tree based on previous attempts
-
Error Detection and Correction
- Uses binary search to locate errors in solutions
- Generates new solutions based on correct parts
-
Solution Verification
- Verifies solutions using dual-agent system or golden answers
- Maintains solution quality through strict verification
Configuration Options
search_limit
: Maximum number of search iterations (default: 100)generator_agent
: Specialized agent for answer generationverifier_agent
: Specialized agent for answer verificationgolden_answers
: Pre-defined correct answers for validation
Output Format
The solution tree is exported in JSON format containing:
- Solutions with intermediate steps
- Golden answers used for verification
- Export timestamp
- Solution scores and verification results
Self-Instruct: Instruction Generation
Overview
The Self-Instruct module implements a pipeline for generating and managing machine-generated instructions for tasks. It combines human-written seed instructions with machine-generated ones to create diverse, high-quality task instructions, while ensuring quality through configurable filtering mechanisms.
Core Components
Self Instruct Pipeline
The main pipeline class that orchestrates the instruction generation process.
Key Features:
- Combines human-written and machine-generated instructions using configurable ratios
- Supports classification and non-classification task types
- Built-in instruction filtering and validation
- Automatic instance generation for tasks
- JSON-based data input/output
Instruction Filter
A comprehensive filtering system for validating and filtering generated instructions.
Features:
- Length-based filtering
- Keyword filtering
- Punctuation checks
- Non-English text detection
- ROUGE similarity filtering for deduplication
- Extensible filter registry for custom filters
Usage
Basic Example
Custom Filtering
Configuration Options
Pipeline Parameters
agent
: ChatAgent instance for generating instructionsseed
: Path to human-written seed tasks in JSONL formatnum_machine_instructions
: Number of machine-generated instructions to generate (default: 5)data_output_path
: Path for saving generated data (default: ’./data_output.json’)human_to_machine_ratio
: Ratio of human to machine tasks for generation (default: (6, 2))instruction_filter
: Custom InstructionFilter instance (optional)filter_config
: Configuration dictionary for default filters (optional)
Filter Configuration
The default filter configuration includes:
length
: Configure length constraints for instructionskeyword
: Set up keyword-based filtering rulespunctuation
: Define punctuation validation rulesnon_english
: Configure non-English text detectionrouge_similarity
: Set ROUGE similarity thresholds for deduplication
Pipeline Stages
-
Seed Loading
- Load human-written instructions from JSONL file
- Validate seed format
- Initialize task storage
-
Instruction Generation
- Sample human and machine tasks based on ratio
- Generate new instructions using ChatAgent
- Apply instruction filters
-
Task Classification
- Identify if tasks are classification or non-classification
- Generate appropriate prompts based on task type
-
Instance Generation
- Generate input-output pairs for each task
- Parse and format instances based on task type
- Apply quality filters
-
Data Output
- Save generated tasks and instances to JSON
- Include metadata and configuration details
- Maintain structured output format
Input/Output Format
Seed Tasks (Input)
Generated Data (Output)
Source2Synth: Multi-hop Question-Answer Generation
Overview
Source2Synth is a sophisticated data generation system designed to create multi-hop question-answer pairs from source text data. It implements a pipeline that processes raw text, extracts information pairs, and generates complex, multi-hop reasoning questions with configurable complexity thresholds.
Core Components
UserDataProcessor
The main orchestrator class that manages the entire pipeline from text processing to dataset generation.
Features:
- Single text and batch processing capabilities
- Configurable AI model or rule-based processing
- Integration with MultiHopGeneratorAgent for QA generation
- Random seed control for reproducibility
ExampleConstructor
Handles the construction of training examples from raw text data.
Features:
- Text preprocessing and quality validation
- Information pair extraction with premise-intermediate-conclusion relationships
- Multi-hop QA pair generation using AI or rule-based approaches
- Complexity scoring for generated examples
DataCurator
Manages and curates datasets of multi-hop question-answer pairs.
Features:
- Quality filtering based on configurable criteria
- Complexity threshold filtering
- Deduplication of similar examples
- Dataset sampling to achieve target size
- Random seed control for reproducible sampling
Usage
Basic Example
Configuration Options
ProcessorConfig
Key parameters:
seed
: Random seed for reproducibilitymin_length
: Minimum text length for processingmax_length
: Maximum text length for processingcomplexity_threshold
: Minimum complexity score (0.0-1.0)dataset_size
: Target size for the final datasetuse_ai_model
: Toggle between AI model and rule-based processinghop_generating_agent
: Custom MultiHopGeneratorAgent instance (optional)
Pipeline Stages
-
Text Preprocessing
- Length validation
- Quality checks
- Text standardization
-
Information Extraction
- Premise identification
- Intermediate relationship extraction
- Conclusion formation
-
QA Generation
- Multi-hop question generation
- Answer validation
- Complexity scoring
-
Dataset Curation
- Quality filtering
- Complexity thresholding
- Deduplication
- Target size sampling
Self-Improving CoT Data Generation
Overview
The Self-Improving CoT Data Generation pipeline implements an iterative approach to generate and improve reasoning traces for problem-solving tasks. This implementation is based on the methodology of self-taught reasoning, where an AI agent learns to improve its reasoning process through self-evaluation and feedback.
Architecture
The pipeline consists of four main stages:
- Initial reasoning trace generation
- Self-evaluation
- Feedback-based improvement
- Iterative refinement
Key Components
SelfImprovingCoTPipeline Class
The core class that implements the STaR methodology with the following features:
- Customizable reasoning and evaluation agents
- Support for both agent-based evaluation and external reward models
- Configurable quality thresholds for different evaluation dimensions
- Iterative improvement process with customizable maximum iterations
- Optional few-shot examples for better reasoning generation
- Flexible output formats and file saving options
Usage
Basic Example
Advanced Usage with External Reward Models
Input/Output Formats
Input Format
The pipeline expects problems in a JSON format:
Output Format
The pipeline generates output in JSON format containing:
- Original problem
- Final reasoning trace
- Improvement history with iterations
- Evaluation scores and feedback for each iteration
Configuration Options
max_iterations
: Maximum number of improvement iterations (default: 3)score_threshold
: Quality thresholds for evaluation dimensions (default: 0.7)few_shot_examples
: Optional examples for few-shot learningoutput_path
: Path for saving results (optional)