> ## Documentation Index > Fetch the complete documentation index at: https://docs.camel-ai.org/llms.txt > Use this file to discover all available pages before exploring further. # Datagen > CAMEL’s data generation modules for high-quality, instruction-tuned, and reasoning-rich datasets. This page introduces CAMEL's **data generation modules** for creating high-quality training data with explicit reasoning, diverse instructions, and advanced automated refinement. * **Chain of Thought (CoT):** Generates explicit reasoning paths * **Self-Instruct:** Produces instruction-following data from both humans and machines * **Source2Synth:** Synthesizes multi-hop QA from source text or code * **Self-Improving CoT:** Iteratively improves reasoning through agent self-critique ## Chain of Thought (CoT) Data Generation Chain of Thought (CoT) data generation creates step-by-step reasoning paths for problem solving, leveraging dual agents and advanced search/verification logic. * Monte Carlo Tree Search (MCTS) for solution exploration * Binary Search Error Detection for precise error localization * Dual-Agent Verification System for quality assurance * Solution Tree Management for tracking reasoning paths **CoTDataGenerator Class** The main class that implements the CoT generation system with the following capabilities: * **Dual-Agent Architecture**: Supports both single-agent (legacy) and dual-agent modes * **Answer Generation**: Sophisticated answer generation with MCTS * **Answer Verification**: Robust verification system using golden answers * **Error Detection**: Binary search-based error detection in solutions * **Solution Management**: Comprehensive solution tree management and export Spin up chain-of-thought data generation with dual agents, golden answers, and CoT solution generation: ```python theme={"system"} from camel.agents import ChatAgent from camel.datagen import CoTDataGenerator # Initialize agents generator_agent = ChatAgent("Generator agent for simple math computation.") verifier_agent = ChatAgent("Verified agent for simple math computation.") # Define golden answers question = "What's the answer of 1 + 2?" golden_answers = { question: "3", } # Create generator cot_generator = CoTDataGenerator( generator_agent=generator_agent, verifier_agent=verifier_agent, golden_answers=golden_answers, search_limit=3, ) # Generate solution solution = cot_generator.solve(question) ``` Easily import question-answer pairs or export generated solutions for further use: ```python theme={"system"} # Import QA pairs from JSON cot_generator.import_qa_from_json("qa_pairs.json") # Export solutions cot_generator.export_solutions("solutions.json") ``` First, the agent attempts to solve the problem directly and checks the result against the golden answer for correctness. If the direct attempt fails, a Monte Carlo Tree Search (MCTS) explores alternative reasoning paths, building a solution tree from previous attempts. Binary search is used to efficiently pinpoint and isolate errors in the solution. New solutions are then generated, reusing verified-correct parts. All candidate solutions are strictly verified using a dual-agent system or comparison against golden answers to ensure high quality and accuracy.

search\_limit: Maximum number of search iterations (default: 100)
generator\_agent: Specialized agent for answer generation
verifier\_agent: Specialized agent for answer verification
golden\_answers: Pre-defined correct answers for validation

The solution tree is exported in JSON format containing:

Solutions with intermediate steps
Golden answers used for verification
Export timestamp

*** ## Self-Instruct: Instruction Generation Self-Instruct is a pipeline for generating high-quality, diverse instructions by combining human-written seed tasks and machine-generated prompts, all filtered for quality and diversity.

Combines human-written and machine-generated instructions using configurable ratios
Supports both classification and non-classification task types
Built-in instruction filtering and validation
Automatic instance generation for tasks
JSON-based data input/output

SelfInstructPipeline – Orchestrates the end-to-end instruction generation, mixing seeds and machine prompts, filtering, and outputting results.

InstructionFilter – Handles validation and filtering of all generated instructions:

Length-based, keyword, and punctuation checks
Non-English text detection
ROUGE similarity filtering for deduplication
Extensible registry for custom filters

Quickly set up an instruction generation pipeline with both human and machine prompts: ```python theme={"system"} from camel.agents import ChatAgent from camel.datagen.self_instruct import SelfInstructPipeline # Initialize agent agent = ChatAgent() # Create pipeline with default settings pipeline = SelfInstructPipeline( agent=agent, seed='seed_tasks.jsonl', # Path to human-written seed tasks num_machine_instructions=5, data_output_path='./data_output.json', human_to_machine_ratio=(6, 2) ) # Generate instructions pipeline.generate() ``` Use custom filters to refine and deduplicate instructions as needed: ```python theme={"system"} from camel.datagen.self_instruct import SelfInstructPipeline from camel.datagen.self_instruct.filter import InstructionFilter # Configure filters filter_config = { "length": {}, "keyword": {}, "punctuation": {}, "non_english": {}, "rouge_similarity": { "threshold": 0.7, "metric": "rouge-l" } } pipeline = SelfInstructPipeline( agent=agent, seed='seed_tasks.jsonl', instruction_filter=InstructionFilter(filter_config), num_machine_instructions=5 ) ``` Load and validate human-written instructions from JSONL file; initialize task storage. Sample both human and machine tasks based on your chosen ratio, then generate new instructions with ChatAgent and apply filters. Automatically determine if tasks are classification or not, and generate the right prompts for each type. Generate input-output pairs, format and parse instances, and apply quality filters. Save all generated instructions and their instances to JSON, with metadata and configuration details.

agent: ChatAgent instance for generating instructions
seed: Path to human-written seed tasks in JSONL format
num\_machine\_instructions: Number of machine-generated instructions (default: 5)
data\_output\_path: Path for saving generated data (default: ./data\_output.json)
human\_to\_machine\_ratio: Ratio of human to machine tasks (default: (6, 2))
instruction\_filter: Custom InstructionFilter instance (optional)
filter\_config: Configuration dictionary for default filters (optional)

The default filter configuration supports:

length: Configure length constraints for instructions
keyword: Set up keyword-based filtering rules
punctuation: Define punctuation validation rules
non\_english: Non-English text detection
rouge\_similarity: Set ROUGE similarity thresholds for deduplication

Seed Tasks (Input): ```json theme={"system"} {"instruction": "Classify the sentiment of this text as positive or negative."} {"instruction": "Generate a summary of the given paragraph."} ``` Generated Data (Output): ```json theme={"system"} { "machine_instructions": [ { "instruction": "...", "is_classification": true, "instances": [ { "input": "...", "output": "..." } ] } ] } ``` *** ## Source2Synth: Multi-hop Question-Answer Generation Source2Synth generates complex multi-hop QA pairs from source text (or code) via an orchestrated pipeline of AI-driven and rule-based steps, with curation and complexity control. UserDataProcessor: Orchestrates the full pipeline, from raw text through QA generation and curation.

ExampleConstructor: Builds multi-hop QA examples, extracting premise, intermediate steps, and conclusions.

DataCurator: Filters, deduplicates, and samples the final dataset to match quality and complexity requirements.

Batch or single text processing
Switchable AI or rule-based question generation
Multi-hop QA and complexity scoring
Integrated curation, deduplication, and reproducible sampling
Seamless MultiHopGeneratorAgent integration

Rapidly generate a multi-hop QA dataset from your own text or source files: ```python theme={"system"} from camel.datagen.source2synth import ( UserDataProcessor, ProcessorConfig ) # Create configuration config = ProcessorConfig( seed=42, min_length=50, max_length=1000, complexity_threshold=0.5, dataset_size=10, use_ai_model=True, ) # Initialize processor processor = UserDataProcessor(config) # Process a single text result = processor.process_text( "Your source text here", source="example_source" ) # Process multiple texts texts = ["Text 1", "Text 2", "Text 3"] sources = ["source1", "source2", "source3"] batch_results = processor.process_batch(texts, sources) ```

seed: Random seed for reproducibility
min\_length: Minimum text length for processing
max\_length: Maximum text length for processing
complexity\_threshold: Minimum complexity score (0.0–1.0)
dataset\_size: Target size for the final dataset
use\_ai\_model: Toggle between AI model and rule-based generation
hop\_generating\_agent: Custom MultiHopGeneratorAgent (optional)

Validate text length and quality; standardize for processing. Identify premises, extract intermediate facts, and form conclusions. Generate multi-hop questions, validate answers, and score for complexity. Filter for quality, enforce complexity thresholds, deduplicate, and sample to target size. *** ## Self-Improving CoT Data Generation This pipeline implements self-taught reasoning—an iterative process where an AI agent refines its own reasoning traces via self-evaluation, feedback, and reward models for continual improvement. SelfImprovingCoTPipeline: Implements the STaR (Self-Taught Reasoning) methodology, supporting both agent-based and external reward model evaluation, iterative feedback loops, and flexible output formats.

* Customizable reasoning and evaluation agents
* Support for reward models and custom thresholds
* Few-shot learning and rich output options The pipeline generates an initial reasoning path for each problem using the designated agent. An evaluator agent (or reward model) critically reviews each reasoning trace for quality, clarity, and correctness. The system refines and re-generates reasoning steps using the evaluation feedback. This evaluation-feedback loop is repeated for a configurable number of iterations to reach optimal performance. Launch a self-improving reasoning workflow with just a few lines: ```python theme={"system"} from camel.agents import ChatAgent from camel.datagen import SelfImprovingCoTPipeline # Initialize agents reason_agent = ChatAgent( """Answer my question and give your final answer within \\boxed{}.""" ) evaluate_agent = ChatAgent( "You are a highly critical teacher who evaluates the student's answers " "with a meticulous and demanding approach." ) # Prepare your problems problems = [ {"problem": "Your problem text here"}, # Add more problems... ] # Create and run the pipeline pipeline = SelfImprovingCoTPipeline( reason_agent=reason_agent, evaluate_agent=evaluate_agent, problems=problems, max_iterations=3, output_path="star_output.json" ) results = pipeline.generate() ``` Evaluate and guide reasoning traces with an external reward model, such as Nemotron: ```python theme={"system"} from camel.models.reward import NemotronRewardModel # Initialize reward model reward_model = NemotronRewardModel( model_type=ModelType.NVIDIA_NEMOTRON_340B_REWARD, url="https://integrate.api.nvidia.com/v1", api_key="your_api_key" ) # Create pipeline with reward model pipeline = SelfImprovingCoTPipeline( reason_agent=reason_agent, evaluate_agent=evaluate_agent, problems=problems, reward_model=reward_model, score_threshold={ "correctness": 0.8, "clarity": 0.7, "completeness": 0.7 } ) ``` Input Format (JSON): ```json theme={"system"} { "problems": [ { "problem": "Problem text here", "solution": "Optional solution text" } ] } ``` Output Format (JSON):

Original problem
Final reasoning trace
Improvement history with iterations
Evaluation scores and feedback per iteration

max\_iterations: Maximum number of improvement iterations (default: 3)
score\_threshold: Minimum quality thresholds for evaluation dimensions (default: 0.7)
few\_shot\_examples: (Optional) Examples for few-shot learning
output\_path: (Optional) Path for saving generated results