camel.datagen.source2synth package#
Submodules#
camel.datagen.source2synth.data_processor module#
- class camel.datagen.source2synth.data_processor.DataCurator(config: ProcessorConfig, rng: Random)[source]#
Bases:
object
Manages and curates datasets of multi-hop question-answer pairs.
This class handles dataset management tasks including quality filtering, complexity filtering, deduplication, and dataset sampling.
- config#
Configuration for data curation parameters.
- Type:
- rng#
Random number generator for reproducible sampling.
- Type:
random.Random
- class camel.datagen.source2synth.data_processor.ExampleConstructor(config: ProcessorConfig, multi_hop_agent: MultiHopGeneratorAgent | None = None)[source]#
Bases:
object
Constructs training examples from raw text data.
This class handles the construction of training examples by preprocessing text, extracting information pairs, and generating question-answer pairs.
- config#
Configuration for example construction.
- Type:
- multi_hop_agent#
Agent for QA generation.
- Type:
Optional[MultiHopGeneratorAgent]
- construct_examples(raw_data: List[Dict[str, Any]]) List[Dict[str, Any]] [source]#
Construct training examples from raw data.
- Parameters:
raw_data (List[Dict[str, Any]]) – List of raw data dictionaries containing text and metadata.
- Returns:
- List of constructed examples with QA pairs
and metadata.
- Return type:
List[Dict[str, Any]]
- class camel.datagen.source2synth.data_processor.UserDataProcessor(config: ProcessorConfig | None = None)[source]#
Bases:
object
A processor for generating multi-hop question-answer pairs from user data.
This class handles the processing of text data to generate multi-hop question-answer pairs using either an AI model or rule-based approaches. It manages the entire pipeline from text preprocessing to dataset curation.
- config#
Configuration for data processing parameters.
- Type:
- rng#
Random number generator for reproducibility.
- Type:
random.Random
- multi_hop_agent#
Agent for generating QA pairs.
- Type:
Optional[MultiHopGeneratorAgent]
- process_batch(texts: List[str], sources: List[str] | None = None) List[Dict[str, Any]] [source]#
Process multiple texts in batch to generate multi-hop QA pairs.
- Parameters:
texts (List[str]) – List of input texts to process.
sources (Optional[List[str]], optional) – List of source identifiers. (default:
None
)
- Returns:
- List of processed examples with QA pairs and
metadata.
- Return type:
List[Dict[str, Any]]
- Raises:
ValueError – If length of sources doesn’t match length of texts.
- process_text(text: str, source: str = 'user_input') List[Dict[str, Any]] [source]#
Process a single text to generate multi-hop QA pairs.
- Parameters:
text (str) – The input text to process.
source (str, optional) – Source identifier for the text. (default:
"user_input"
)
- Returns:
- List of processed examples with QA pairs and
metadata.
- Return type:
List[Dict[str, Any]]
camel.datagen.source2synth.models module#
- class camel.datagen.source2synth.models.ContextPrompt(*, main_context: str, related_contexts: List[str] | None = None)[source]#
Bases:
BaseModel
A context prompt for generating multi-hop question-answer pairs.
- main_context#
The primary context for generating QA pairs.
- Type:
str
Additional related contexts.
- Type:
Optional[List[str]]
- main_context: str#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'main_context': FieldInfo(annotation=str, required=True, description='The main context for generating the question-answer pair.'), 'related_contexts': FieldInfo(annotation=Union[List[str], NoneType], required=False, default=None, description='Additional contexts related to the main context.')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- related_contexts: List[str] | None#
- class camel.datagen.source2synth.models.MultiHopQA(*, question: str, reasoning_steps: List[ReasoningStep], answer: str, supporting_facts: List[str], type: str)[source]#
Bases:
BaseModel
A multi-hop question-answer pair with reasoning steps and supporting facts.
- question#
The question requiring multi-hop reasoning.
- Type:
str
- reasoning_steps#
List of reasoning steps to answer.
- Type:
List[ReasoningStep]
- answer#
The final answer to the question.
- Type:
str
- supporting_facts#
List of facts supporting the reasoning.
- Type:
List[str]
- type#
The type of question-answer pair.
- Type:
str
- class Config[source]#
Bases:
object
- json_schema_extra: ClassVar[Dict[str, Any]] = {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}#
- answer: str#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'json_schema_extra': {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'answer': FieldInfo(annotation=str, required=True, description='The answer to the multi-hop question.'), 'question': FieldInfo(annotation=str, required=True, description='The question that requires multi-hop reasoning.'), 'reasoning_steps': FieldInfo(annotation=List[ReasoningStep], required=True, description='The steps involved in reasoning to answer the question.'), 'supporting_facts': FieldInfo(annotation=List[str], required=True, description='Facts that support the reasoning and answer.'), 'type': FieldInfo(annotation=str, required=True, description='The type of question-answer pair.')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- question: str#
- reasoning_steps: List[ReasoningStep]#
- supporting_facts: List[str]#
- type: str#
- class camel.datagen.source2synth.models.ReasoningStep(*, step: str)[source]#
Bases:
BaseModel
A single step in a multi-hop reasoning process.
- step#
The textual description of the reasoning step.
- Type:
str
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'step': FieldInfo(annotation=str, required=True, description='A single step in the reasoning process.')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- step: str#
camel.datagen.source2synth.user_data_processor_config module#
- class camel.datagen.source2synth.user_data_processor_config.ProcessorConfig(*, seed: int = None, min_length: int = 50, max_length: int = 512, complexity_threshold: float = 0.5, dataset_size: int = 1000, use_ai_model: bool = True, hop_generating_agent: MultiHopGeneratorAgent = None)[source]#
Bases:
BaseModel
Data processing configuration class
- complexity_threshold: float#
- dataset_size: int#
- hop_generating_agent: MultiHopGeneratorAgent#
- max_length: int#
- min_length: int#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'frozen': False, 'protected_namespaces': (), 'validate_assignment': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'complexity_threshold': FieldInfo(annotation=float, required=False, default=0.5, description='Complexity threshold for processing', metadata=[Ge(ge=0.0), Le(le=1.0)]), 'dataset_size': FieldInfo(annotation=int, required=False, default=1000, description='Target size of the dataset', metadata=[Gt(gt=0)]), 'hop_generating_agent': FieldInfo(annotation=MultiHopGeneratorAgent, required=False, default_factory=<lambda>, description='Agent for generating multi-hop text'), 'max_length': FieldInfo(annotation=int, required=False, default=512, description='Maximum text length', metadata=[Gt(gt=0)]), 'min_length': FieldInfo(annotation=int, required=False, default=50, description='Minimum text length', metadata=[Ge(ge=0)]), 'seed': FieldInfo(annotation=int, required=False, default_factory=<lambda>, description='Random seed for reproducibility'), 'use_ai_model': FieldInfo(annotation=bool, required=False, default=True, description='Whether to use AI model in processing')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- seed: int#
- use_ai_model: bool#
Module contents#
- class camel.datagen.source2synth.DataCurator(config: ProcessorConfig, rng: Random)[source]#
Bases:
object
Manages and curates datasets of multi-hop question-answer pairs.
This class handles dataset management tasks including quality filtering, complexity filtering, deduplication, and dataset sampling.
- config#
Configuration for data curation parameters.
- Type:
- rng#
Random number generator for reproducible sampling.
- Type:
random.Random
- class camel.datagen.source2synth.ExampleConstructor(config: ProcessorConfig, multi_hop_agent: MultiHopGeneratorAgent | None = None)[source]#
Bases:
object
Constructs training examples from raw text data.
This class handles the construction of training examples by preprocessing text, extracting information pairs, and generating question-answer pairs.
- config#
Configuration for example construction.
- Type:
- multi_hop_agent#
Agent for QA generation.
- Type:
Optional[MultiHopGeneratorAgent]
- construct_examples(raw_data: List[Dict[str, Any]]) List[Dict[str, Any]] [source]#
Construct training examples from raw data.
- Parameters:
raw_data (List[Dict[str, Any]]) – List of raw data dictionaries containing text and metadata.
- Returns:
- List of constructed examples with QA pairs
and metadata.
- Return type:
List[Dict[str, Any]]
- class camel.datagen.source2synth.MultiHopQA(*, question: str, reasoning_steps: List[ReasoningStep], answer: str, supporting_facts: List[str], type: str)[source]#
Bases:
BaseModel
A multi-hop question-answer pair with reasoning steps and supporting facts.
- question#
The question requiring multi-hop reasoning.
- Type:
str
- reasoning_steps#
List of reasoning steps to answer.
- Type:
List[ReasoningStep]
- answer#
The final answer to the question.
- Type:
str
- supporting_facts#
List of facts supporting the reasoning.
- Type:
List[str]
- type#
The type of question-answer pair.
- Type:
str
- class Config[source]#
Bases:
object
- json_schema_extra: ClassVar[Dict[str, Any]] = {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}#
- answer: str#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'json_schema_extra': {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'answer': FieldInfo(annotation=str, required=True, description='The answer to the multi-hop question.'), 'question': FieldInfo(annotation=str, required=True, description='The question that requires multi-hop reasoning.'), 'reasoning_steps': FieldInfo(annotation=List[ReasoningStep], required=True, description='The steps involved in reasoning to answer the question.'), 'supporting_facts': FieldInfo(annotation=List[str], required=True, description='Facts that support the reasoning and answer.'), 'type': FieldInfo(annotation=str, required=True, description='The type of question-answer pair.')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- question: str#
- reasoning_steps: List[ReasoningStep]#
- supporting_facts: List[str]#
- type: str#
- class camel.datagen.source2synth.ProcessorConfig(*, seed: int = None, min_length: int = 50, max_length: int = 512, complexity_threshold: float = 0.5, dataset_size: int = 1000, use_ai_model: bool = True, hop_generating_agent: MultiHopGeneratorAgent = None)[source]#
Bases:
BaseModel
Data processing configuration class
- complexity_threshold: float#
- dataset_size: int#
- hop_generating_agent: MultiHopGeneratorAgent#
- max_length: int#
- min_length: int#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'frozen': False, 'protected_namespaces': (), 'validate_assignment': True}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'complexity_threshold': FieldInfo(annotation=float, required=False, default=0.5, description='Complexity threshold for processing', metadata=[Ge(ge=0.0), Le(le=1.0)]), 'dataset_size': FieldInfo(annotation=int, required=False, default=1000, description='Target size of the dataset', metadata=[Gt(gt=0)]), 'hop_generating_agent': FieldInfo(annotation=MultiHopGeneratorAgent, required=False, default_factory=<lambda>, description='Agent for generating multi-hop text'), 'max_length': FieldInfo(annotation=int, required=False, default=512, description='Maximum text length', metadata=[Gt(gt=0)]), 'min_length': FieldInfo(annotation=int, required=False, default=50, description='Minimum text length', metadata=[Ge(ge=0)]), 'seed': FieldInfo(annotation=int, required=False, default_factory=<lambda>, description='Random seed for reproducibility'), 'use_ai_model': FieldInfo(annotation=bool, required=False, default=True, description='Whether to use AI model in processing')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- seed: int#
- use_ai_model: bool#
- class camel.datagen.source2synth.ReasoningStep(*, step: str)[source]#
Bases:
BaseModel
A single step in a multi-hop reasoning process.
- step#
The textual description of the reasoning step.
- Type:
str
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'step': FieldInfo(annotation=str, required=True, description='A single step in the reasoning process.')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- step: str#
- class camel.datagen.source2synth.UserDataProcessor(config: ProcessorConfig | None = None)[source]#
Bases:
object
A processor for generating multi-hop question-answer pairs from user data.
This class handles the processing of text data to generate multi-hop question-answer pairs using either an AI model or rule-based approaches. It manages the entire pipeline from text preprocessing to dataset curation.
- config#
Configuration for data processing parameters.
- Type:
- rng#
Random number generator for reproducibility.
- Type:
random.Random
- multi_hop_agent#
Agent for generating QA pairs.
- Type:
Optional[MultiHopGeneratorAgent]
- process_batch(texts: List[str], sources: List[str] | None = None) List[Dict[str, Any]] [source]#
Process multiple texts in batch to generate multi-hop QA pairs.
- Parameters:
texts (List[str]) – List of input texts to process.
sources (Optional[List[str]], optional) – List of source identifiers. (default:
None
)
- Returns:
- List of processed examples with QA pairs and
metadata.
- Return type:
List[Dict[str, Any]]
- Raises:
ValueError – If length of sources doesn’t match length of texts.
- process_text(text: str, source: str = 'user_input') List[Dict[str, Any]] [source]#
Process a single text to generate multi-hop QA pairs.
- Parameters:
text (str) – The input text to process.
source (str, optional) – Source identifier for the text. (default:
"user_input"
)
- Returns:
- List of processed examples with QA pairs and
metadata.
- Return type:
List[Dict[str, Any]]