This page introduces CAMEL’s data generation modules for creating high-quality training data with explicit reasoning, diverse instructions, and advanced automated refinement.
  • Chain of Thought (CoT): Generates explicit reasoning paths
  • Self-Instruct: Produces instruction-following data from both humans and machines
  • Source2Synth: Synthesizes multi-hop QA from source text or code
  • Self-Improving CoT: Iteratively improves reasoning through agent self-critique

Chain of Thought (CoT) Data Generation

Chain of Thought (CoT) data generation creates step-by-step reasoning paths for problem solving, leveraging dual agents and advanced search/verification logic.
  • Monte Carlo Tree Search (MCTS) for solution exploration
  • Binary Search Error Detection for precise error localization
  • Dual-Agent Verification System for quality assurance
  • Solution Tree Management for tracking reasoning paths
CoTDataGenerator ClassThe main class that implements the CoT generation system with the following capabilities:
  • Dual-Agent Architecture: Supports both single-agent (legacy) and dual-agent modes
  • Answer Generation: Sophisticated answer generation with MCTS
  • Answer Verification: Robust verification system using golden answers
  • Error Detection: Binary search-based error detection in solutions
  • Solution Management: Comprehensive solution tree management and export

Quick Start: CoT Data Generation

Spin up chain-of-thought data generation with dual agents, golden answers, and CoT solution generation:
from camel.agents import ChatAgent
from camel.datagen import CoTDataGenerator

# Initialize agents
generator_agent = ChatAgent("System message for generator")
verifier_agent = ChatAgent("System message for verifier")

# Define golden answers
golden_answers = {
    "question1": "answer1",
    "question2": "answer2"
}

# Create generator
cot_generator = CoTDataGenerator(
    generator_agent=generator_agent,
    verifier_agent=verifier_agent,
    golden_answers=golden_answers,
    search_limit=100
)

# Generate solution
solution = cot_generator.solve("question1")

Data Import/Export for CoT

Easily import question-answer pairs or export generated solutions for further use:
# Import QA pairs from JSON
cot_generator.import_qa_from_json("qa_pairs.json")

# Export solutions
cot_generator.export_solutions("solutions.json")
1

Direct Solution Attempt

First, the agent attempts to solve the problem directly and checks the result against the golden answer for correctness.
2

MCTS-Based Exploration

If the direct attempt fails, a Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, building a solution tree from previous attempts.
3

Error Detection & Correction

Binary search is used to efficiently pinpoint and isolate errors in the solution. New solutions are then generated, reusing verified-correct parts.
4

Solution Verification

All candidate solutions are strictly verified using a dual-agent system or comparison against golden answers to ensure high quality and accuracy.
  • search_limit: Maximum number of search iterations (default: 100)
  • generator_agent: Specialized agent for answer generation
  • verifier_agent: Specialized agent for answer verification
  • golden_answers: Pre-defined correct answers for validation
The solution tree is exported in JSON format containing:
  • Solutions with intermediate steps
  • Golden answers used for verification
  • Export timestamp
  • Solution scores and verification results

Self-Instruct: Instruction Generation

Self-Instruct is a pipeline for generating high-quality, diverse instructions by combining human-written seed tasks and machine-generated prompts, all filtered for quality and diversity.
  • Combines human-written and machine-generated instructions using configurable ratios
  • Supports both classification and non-classification task types
  • Built-in instruction filtering and validation
  • Automatic instance generation for tasks
  • JSON-based data input/output
SelfInstructPipeline – Orchestrates the end-to-end instruction generation, mixing seeds and machine prompts, filtering, and outputting results.

InstructionFilter – Handles validation and filtering of all generated instructions:
  • Length-based, keyword, and punctuation checks
  • Non-English text detection
  • ROUGE similarity filtering for deduplication
  • Extensible registry for custom filters

Quick Start: Self-Instruct Generation

Quickly set up an instruction generation pipeline with both human and machine prompts:
from camel.agents import ChatAgent
from camel.datagen.self_instruct import SelfInstructPipeline

# Initialize agent
agent = ChatAgent()

# Create pipeline with default settings
pipeline = SelfInstructPipeline(
    agent=agent,
    seed='seed_tasks.jsonl',  # Path to human-written seed tasks
    num_machine_instructions=5,
    data_output_path='./data_output.json',
    human_to_machine_ratio=(6, 2)
)

# Generate instructions
pipeline.generate()

Custom Filtering Example

Use custom filters to refine and deduplicate instructions as needed:
from camel.datagen.self_instruct import SelfInstructPipeline
from camel.datagen.self_instruct.filter import InstructionFilter

# Configure filters
filter_config = {
    "length": {},
    "keyword": {},
    "punctuation": {},
    "non_english": {},
    "rouge_similarity": {
        "threshold": 0.7,
        "metric": "rouge-l"
    }
}

pipeline = SelfInstructPipeline(
    agent=agent,
    seed='seed_tasks.jsonl',
    instruction_filter=InstructionFilter(filter_config),
    num_machine_instructions=5
)
1

Seed Loading

Load and validate human-written instructions from JSONL file; initialize task storage.
2

Instruction Generation

Sample both human and machine tasks based on your chosen ratio, then generate new instructions with ChatAgent and apply filters.
3

Task Classification

Automatically determine if tasks are classification or not, and generate the right prompts for each type.
4

Instance Generation

Generate input-output pairs, format and parse instances, and apply quality filters.
5

Data Output

Save all generated instructions and their instances to JSON, with metadata and configuration details.
  • agent: ChatAgent instance for generating instructions
  • seed: Path to human-written seed tasks in JSONL format
  • num_machine_instructions: Number of machine-generated instructions (default: 5)
  • data_output_path: Path for saving generated data (default: ./data_output.json)
  • human_to_machine_ratio: Ratio of human to machine tasks (default: (6, 2))
  • instruction_filter: Custom InstructionFilter instance (optional)
  • filter_config: Configuration dictionary for default filters (optional)
The default filter configuration supports:
  • length: Configure length constraints for instructions
  • keyword: Set up keyword-based filtering rules
  • punctuation: Define punctuation validation rules
  • non_english: Non-English text detection
  • rouge_similarity: Set ROUGE similarity thresholds for deduplication
Seed Tasks (Input):
{"instruction": "Classify the sentiment of this text as positive or negative."}
{"instruction": "Generate a summary of the given paragraph."}
Generated Data (Output):
{
  "machine_instructions": [
    {
      "instruction": "...",
      "is_classification": true,
      "instances": [
        {
          "input": "...",
          "output": "..."
        }
      ]
    }
  ]
}

Source2Synth: Multi-hop Question-Answer Generation

Source2Synth generates complex multi-hop QA pairs from source text (or code) via an orchestrated pipeline of AI-driven and rule-based steps, with curation and complexity control.
UserDataProcessor: Orchestrates the full pipeline, from raw text through QA generation and curation.

ExampleConstructor: Builds multi-hop QA examples, extracting premise, intermediate steps, and conclusions.

DataCurator: Filters, deduplicates, and samples the final dataset to match quality and complexity requirements.
  • Batch or single text processing
  • Switchable AI or rule-based question generation
  • Multi-hop QA and complexity scoring
  • Integrated curation, deduplication, and reproducible sampling
  • Seamless MultiHopGeneratorAgent integration

Quick Start: Source2Synth Pipeline

Rapidly generate a multi-hop QA dataset from your own text or source files:
from camel.datagen.source2synth import (
    UserDataProcessor,
    ProcessorConfig
)

# Create configuration
config = ProcessorConfig(
    seed=42,
    min_length=50,
    max_length=1000,
    complexity_threshold=0.5,
    dataset_size=10,
    use_ai_model=True,
)

# Initialize processor
processor = UserDataProcessor(config)

# Process a single text
result = processor.process_text(
    "Your source text here",
    source="example_source"
)

# Process multiple texts
texts = ["Text 1", "Text 2", "Text 3"]
sources = ["source1", "source2", "source3"]
batch_results = processor.process_batch(texts, sources)
  • seed: Random seed for reproducibility
  • min_length: Minimum text length for processing
  • max_length: Maximum text length for processing
  • complexity_threshold: Minimum complexity score (0.0–1.0)
  • dataset_size: Target size for the final dataset
  • use_ai_model: Toggle between AI model and rule-based generation
  • hop_generating_agent: Custom MultiHopGeneratorAgent (optional)
1

Text Preprocessing

Validate text length and quality; standardize for processing.
2

Information Extraction

Identify premises, extract intermediate facts, and form conclusions.
3

QA Generation

Generate multi-hop questions, validate answers, and score for complexity.
4

Dataset Curation

Filter for quality, enforce complexity thresholds, deduplicate, and sample to target size.

Self-Improving CoT Data Generation

This pipeline implements self-taught reasoning—an iterative process where an AI agent refines its own reasoning traces via self-evaluation, feedback, and reward models for continual improvement.
SelfImprovingCoTPipeline: Implements the STaR (Self-Taught Reasoning) methodology, supporting both agent-based and external reward model evaluation, iterative feedback loops, and flexible output formats.

  • Customizable reasoning and evaluation agents
  • Support for reward models and custom thresholds
  • Few-shot learning and rich output options
1

Initial Reasoning Trace Generation

The pipeline generates an initial reasoning path for each problem using the designated agent.
2

Self-Evaluation

An evaluator agent (or reward model) critically reviews each reasoning trace for quality, clarity, and correctness.
3

Feedback-Based Improvement

The system refines and re-generates reasoning steps using the evaluation feedback.
4

Iterative Refinement

This evaluation-feedback loop is repeated for a configurable number of iterations to reach optimal performance.

Quick Start: Self-Improving CoT Pipeline

Launch a self-improving reasoning workflow with just a few lines:
from camel.agents import ChatAgent
from camel.datagen import SelfImprovingCoTPipeline

# Initialize agents
reason_agent = ChatAgent(
    """Answer my question and give your 
    final answer within \\boxed{}."""
)

evaluate_agent = ChatAgent(
    "You are a highly critical teacher who evaluates the student's answers "
    "with a meticulous and demanding approach."
)

# Prepare your problems
problems = [
    {"problem": "Your problem text here"},
    # Add more problems...
]

# Create and run the pipeline
pipeline = SelfImprovingCoTPipeline(
    reason_agent=reason_agent,
    evaluate_agent=evaluate_agent,
    problems=problems,
    max_iterations=3,
    output_path="star_output.json"
)

results = pipeline.generate()

Advanced: External Reward Model Integration

Evaluate and guide reasoning traces with an external reward model, such as Nemotron:
from camel.models.reward import NemotronRewardModel

# Initialize reward model
reward_model = NemotronRewardModel(
    model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD,
    url="https://integrate.api.nvidia.com/v1",
    api_key="your_api_key"
)

# Create pipeline with reward model
pipeline = SelfImprovingCoTPipeline(
    reason_agent=reason_agent,
    evaluate_agent=evaluate_agent,
    problems=problems,
    reward_model=reward_model,
    score_threshold={
        "correctness": 0.8,
        "clarity": 0.7,
        "completeness": 0.7
    }
)
Input Format (JSON):
{
  "problems": [
    {
      "problem": "Problem text here",
      "solution": "Optional solution text"
    }
  ]
}
Output Format (JSON):
  • Original problem
  • Final reasoning trace
  • Improvement history with iterations
  • Evaluation scores and feedback per iteration
  • max_iterations: Maximum number of improvement iterations (default: 3)
  • score_threshold: Minimum quality thresholds for evaluation dimensions (default: 0.7)
  • few_shot_examples: (Optional) Examples for few-shot learning
  • output_path: (Optional) Path for saving generated results