Datagen

This page introduces CAMEL’s data generation modules for creating high-quality training data with explicit reasoning, diverse instructions, and advanced automated refinement.

Chain of Thought (CoT): Generates explicit reasoning paths
Self-Instruct: Produces instruction-following data from both humans and machines
Source2Synth: Synthesizes multi-hop QA from source text or code
Self-Improving CoT: Iteratively improves reasoning through agent self-critique

Chain of Thought (CoT) Data Generation

Chain of Thought (CoT) data generation creates step-by-step reasoning paths for problem solving, leveraging dual agents and advanced search/verification logic.

Key Features

Monte Carlo Tree Search (MCTS) for solution exploration
Binary Search Error Detection for precise error localization
Dual-Agent Verification System for quality assurance
Solution Tree Management for tracking reasoning paths

Core Components

CoTDataGenerator ClassThe main class that implements the CoT generation system with the following capabilities:

Dual-Agent Architecture: Supports both single-agent (legacy) and dual-agent modes
Answer Generation: Sophisticated answer generation with MCTS
Answer Verification: Robust verification system using golden answers
Error Detection: Binary search-based error detection in solutions
Solution Management: Comprehensive solution tree management and export

Quick Start: CoT Data Generation

Spin up chain-of-thought data generation with dual agents, golden answers, and CoT solution generation:

from camel.agents import ChatAgent
from camel.datagen import CoTDataGenerator

# Initialize agents
generator_agent = ChatAgent("System message for generator")
verifier_agent = ChatAgent("System message for verifier")

# Define golden answers
golden_answers = {
    "question1": "answer1",
    "question2": "answer2"
}

# Create generator
cot_generator = CoTDataGenerator(
    generator_agent=generator_agent,
    verifier_agent=verifier_agent,
    golden_answers=golden_answers,
    search_limit=100
)

# Generate solution
solution = cot_generator.solve("question1")

Data Import/Export for CoT

Easily import question-answer pairs or export generated solutions for further use:

# Import QA pairs from JSON
cot_generator.import_qa_from_json("qa_pairs.json")

# Export solutions
cot_generator.export_solutions("solutions.json")

Solution Generation Process

Direct Solution Attempt

First, the agent attempts to solve the problem directly and checks the result against the golden answer for correctness.

MCTS-Based Exploration

If the direct attempt fails, a Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, building a solution tree from previous attempts.

Error Detection & Correction

Binary search is used to efficiently pinpoint and isolate errors in the solution. New solutions are then generated, reusing verified-correct parts.

Solution Verification

All candidate solutions are strictly verified using a dual-agent system or comparison against golden answers to ensure high quality and accuracy.

Configuration Options

search_limit: Maximum number of search iterations (default: 100)
generator_agent: Specialized agent for answer generation
verifier_agent: Specialized agent for answer verification
golden_answers: Pre-defined correct answers for validation

Output Format

The solution tree is exported in JSON format containing:

Solutions with intermediate steps
Golden answers used for verification
Export timestamp
Solution scores and verification results

Self-Instruct: Instruction Generation

Self-Instruct is a pipeline for generating high-quality, diverse instructions by combining human-written seed tasks and machine-generated prompts, all filtered for quality and diversity.

Key Features

Combines human-written and machine-generated instructions using configurable ratios
Supports both classification and non-classification task types
Built-in instruction filtering and validation
Automatic instance generation for tasks
JSON-based data input/output

Core Components

SelfInstructPipeline – Orchestrates the end-to-end instruction generation, mixing seeds and machine prompts, filtering, and outputting results.

InstructionFilter – Handles validation and filtering of all generated instructions:

Length-based, keyword, and punctuation checks
Non-English text detection
ROUGE similarity filtering for deduplication
Extensible registry for custom filters

Quick Start: Self-Instruct Generation

Quickly set up an instruction generation pipeline with both human and machine prompts:

from camel.agents import ChatAgent
from camel.datagen.self_instruct import SelfInstructPipeline

# Initialize agent
agent = ChatAgent()

# Create pipeline with default settings
pipeline = SelfInstructPipeline(
    agent=agent,
    seed='seed_tasks.jsonl',  # Path to human-written seed tasks
    num_machine_instructions=5,
    data_output_path='./data_output.json',
    human_to_machine_ratio=(6, 2)
)

# Generate instructions
pipeline.generate()

Custom Filtering Example

Use custom filters to refine and deduplicate instructions as needed:

from camel.datagen.self_instruct import SelfInstructPipeline
from camel.datagen.self_instruct.filter import InstructionFilter

# Configure filters
filter_config = {
    "length": {},
    "keyword": {},
    "punctuation": {},
    "non_english": {},
    "rouge_similarity": {
        "threshold": 0.7,
        "metric": "rouge-l"
    }
}

pipeline = SelfInstructPipeline(
    agent=agent,
    seed='seed_tasks.jsonl',
    instruction_filter=InstructionFilter(filter_config),
    num_machine_instructions=5
)

Pipeline Stages

Seed Loading

Load and validate human-written instructions from JSONL file; initialize task storage.

Instruction Generation

Sample both human and machine tasks based on your chosen ratio, then generate new instructions with ChatAgent and apply filters.

Task Classification

Automatically determine if tasks are classification or not, and generate the right prompts for each type.

Instance Generation

Generate input-output pairs, format and parse instances, and apply quality filters.

Data Output

Save all generated instructions and their instances to JSON, with metadata and configuration details.

Pipeline Parameters

agent: ChatAgent instance for generating instructions
seed: Path to human-written seed tasks in JSONL format
num_machine_instructions: Number of machine-generated instructions (default: 5)
data_output_path: Path for saving generated data (default: ./data_output.json)
human_to_machine_ratio: Ratio of human to machine tasks (default: (6, 2))
instruction_filter: Custom InstructionFilter instance (optional)
filter_config: Configuration dictionary for default filters (optional)

Filter Configuration

The default filter configuration supports:

length: Configure length constraints for instructions
keyword: Set up keyword-based filtering rules
punctuation: Define punctuation validation rules
non_english: Non-English text detection
rouge_similarity: Set ROUGE similarity thresholds for deduplication

Input/Output Format

Seed Tasks (Input):

{"instruction": "Classify the sentiment of this text as positive or negative."}
{"instruction": "Generate a summary of the given paragraph."}

Generated Data (Output):

{
  "machine_instructions": [
    {
      "instruction": "...",
      "is_classification": true,
      "instances": [
        {
          "input": "...",
          "output": "..."
        }
      ]
    }
  ]
}

Source2Synth: Multi-hop Question-Answer Generation

Source2Synth generates complex multi-hop QA pairs from source text (or code) via an orchestrated pipeline of AI-driven and rule-based steps, with curation and complexity control.

Core Components

UserDataProcessor: Orchestrates the full pipeline, from raw text through QA generation and curation.

ExampleConstructor: Builds multi-hop QA examples, extracting premise, intermediate steps, and conclusions.

DataCurator: Filters, deduplicates, and samples the final dataset to match quality and complexity requirements.

Key Features

Batch or single text processing
Switchable AI or rule-based question generation
Multi-hop QA and complexity scoring
Integrated curation, deduplication, and reproducible sampling
Seamless MultiHopGeneratorAgent integration

Quick Start: Source2Synth Pipeline

Rapidly generate a multi-hop QA dataset from your own text or source files:

from camel.datagen.source2synth import (
    UserDataProcessor,
    ProcessorConfig
)

# Create configuration
config = ProcessorConfig(
    seed=42,
    min_length=50,
    max_length=1000,
    complexity_threshold=0.5,
    dataset_size=10,
    use_ai_model=True,
)

# Initialize processor
processor = UserDataProcessor(config)

# Process a single text
result = processor.process_text(
    "Your source text here",
    source="example_source"
)

# Process multiple texts
texts = ["Text 1", "Text 2", "Text 3"]
sources = ["source1", "source2", "source3"]
batch_results = processor.process_batch(texts, sources)

ProcessorConfig Parameters

seed: Random seed for reproducibility
min_length: Minimum text length for processing
max_length: Maximum text length for processing
complexity_threshold: Minimum complexity score (0.0–1.0)
dataset_size: Target size for the final dataset
use_ai_model: Toggle between AI model and rule-based generation
hop_generating_agent: Custom MultiHopGeneratorAgent (optional)

Pipeline Stages

Text Preprocessing

Validate text length and quality; standardize for processing.

Information Extraction

Identify premises, extract intermediate facts, and form conclusions.

QA Generation

Generate multi-hop questions, validate answers, and score for complexity.

Dataset Curation

Filter for quality, enforce complexity thresholds, deduplicate, and sample to target size.

Self-Improving CoT Data Generation

This pipeline implements self-taught reasoning—an iterative process where an AI agent refines its own reasoning traces via self-evaluation, feedback, and reward models for continual improvement.

Key Components

SelfImprovingCoTPipeline: Implements the STaR (Self-Taught Reasoning) methodology, supporting both agent-based and external reward model evaluation, iterative feedback loops, and flexible output formats.

Customizable reasoning and evaluation agents
Support for reward models and custom thresholds
Few-shot learning and rich output options

Architecture Stages

Initial Reasoning Trace Generation

The pipeline generates an initial reasoning path for each problem using the designated agent.

Self-Evaluation

An evaluator agent (or reward model) critically reviews each reasoning trace for quality, clarity, and correctness.

Feedback-Based Improvement

The system refines and re-generates reasoning steps using the evaluation feedback.

Iterative Refinement

This evaluation-feedback loop is repeated for a configurable number of iterations to reach optimal performance.

Quick Start: Self-Improving CoT Pipeline

Launch a self-improving reasoning workflow with just a few lines:

from camel.agents import ChatAgent
from camel.datagen import SelfImprovingCoTPipeline

# Initialize agents
reason_agent = ChatAgent(
    """Answer my question and give your 
    final answer within \\boxed{}."""
)

evaluate_agent = ChatAgent(
    "You are a highly critical teacher who evaluates the student's answers "
    "with a meticulous and demanding approach."
)

# Prepare your problems
problems = [
    {"problem": "Your problem text here"},
    # Add more problems...
]

# Create and run the pipeline
pipeline = SelfImprovingCoTPipeline(
    reason_agent=reason_agent,
    evaluate_agent=evaluate_agent,
    problems=problems,
    max_iterations=3,
    output_path="star_output.json"
)

results = pipeline.generate()

Advanced: External Reward Model Integration

Evaluate and guide reasoning traces with an external reward model, such as Nemotron:

from camel.models.reward import NemotronRewardModel

# Initialize reward model
reward_model = NemotronRewardModel(
    model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
    url="https://integrate.api.nvidia.com/v1",
    api_key="your_api_key"
)

# Create pipeline with reward model
pipeline = SelfImprovingCoTPipeline(
    reason_agent=reason_agent,
    evaluate_agent=evaluate_agent,
    problems=problems,
    reward_model=reward_model,
    score_threshold={
        "correctness": 0.8,
        "clarity": 0.7,
        "completeness": 0.7
    }
)

Input/Output Format

Input Format (JSON):

{
  "problems": [
    {
      "problem": "Problem text here",
      "solution": "Optional solution text"
    }
  ]
}

Output Format (JSON):

Original problem
Final reasoning trace
Improvement history with iterations
Evaluation scores and feedback per iteration

Configuration Options

max_iterations: Maximum number of improvement iterations (default: 3)
score_threshold: Minimum quality thresholds for evaluation dimensions (default: 0.7)
few_shot_examples: (Optional) Examples for few-shot learning
output_path: (Optional) Path for saving generated results

Get Started

Key Modules

Awesome Toolkits

CAMEL with MCP

Cookbooks

Chain of Thought (CoT) Data Generation

Quick Start: CoT Data Generation

Data Import/Export for CoT

Self-Instruct: Instruction Generation

Quick Start: Self-Instruct Generation

Custom Filtering Example

Source2Synth: Multi-hop Question-Answer Generation

Quick Start: Source2Synth Pipeline

Self-Improving CoT Data Generation

Quick Start: Self-Improving CoT Pipeline

Advanced: External Reward Model Integration

Get Started

Key Modules

Awesome Toolkits

CAMEL with MCP

Cookbooks

​Chain of Thought (CoT) Data Generation

Quick Start: CoT Data Generation

Data Import/Export for CoT

​Self-Instruct: Instruction Generation

Quick Start: Self-Instruct Generation

Custom Filtering Example

​Source2Synth: Multi-hop Question-Answer Generation

Quick Start: Source2Synth Pipeline

​Self-Improving CoT Data Generation

Quick Start: Self-Improving CoT Pipeline

Advanced: External Reward Model Integration

Chain of Thought (CoT) Data Generation

Self-Instruct: Instruction Generation

Source2Synth: Multi-hop Question-Answer Generation

Self-Improving CoT Data Generation