camel.datagen.source2synth package

On this page

camel.datagen.source2synth package#

Submodules#

camel.datagen.source2synth.data_processor module#

class camel.datagen.source2synth.data_processor.DataCurator(config: ProcessorConfig, rng: Random)[source]#

Bases: object

Manages and curates datasets of multi-hop question-answer pairs.

This class handles dataset management tasks including quality filtering, complexity filtering, deduplication, and dataset sampling.

config#

Configuration for data curation parameters.

Type:

ProcessorConfig

rng#

Random number generator for reproducible sampling.

Type:

random.Random

curate_dataset(examples: List[Dict[str, Any]]) List[Dict[str, Any]][source]#

Manage and curate a dataset through multiple filtering stages.

Parameters:

examples (List[Dict[str, Any]]) – List of examples to curate.

Returns:

Curated dataset meeting quality criteria.

Return type:

List[Dict[str, Any]]

class camel.datagen.source2synth.data_processor.ExampleConstructor(config: ProcessorConfig, multi_hop_agent: MultiHopGeneratorAgent | None = None)[source]#

Bases: object

Constructs training examples from raw text data.

This class handles the construction of training examples by preprocessing text, extracting information pairs, and generating question-answer pairs.

config#

Configuration for example construction.

Type:

ProcessorConfig

multi_hop_agent#

Agent for QA generation.

Type:

Optional[MultiHopGeneratorAgent]

construct_examples(raw_data: List[Dict[str, Any]]) List[Dict[str, Any]][source]#

Construct training examples from raw data.

Parameters:

raw_data (List[Dict[str, Any]]) – List of raw data dictionaries containing text and metadata.

Returns:

List of constructed examples with QA pairs

and metadata.

Return type:

List[Dict[str, Any]]

class camel.datagen.source2synth.data_processor.UserDataProcessor(config: ProcessorConfig | None = None)[source]#

Bases: object

A processor for generating multi-hop question-answer pairs from user data.

This class handles the processing of text data to generate multi-hop question-answer pairs using either an AI model or rule-based approaches. It manages the entire pipeline from text preprocessing to dataset curation.

config#

Configuration for data processing parameters.

Type:

ProcessorConfig

rng#

Random number generator for reproducibility.

Type:

random.Random

multi_hop_agent#

Agent for generating QA pairs.

Type:

Optional[MultiHopGeneratorAgent]

process_batch(texts: List[str], sources: List[str] | None = None) List[Dict[str, Any]][source]#

Process multiple texts in batch to generate multi-hop QA pairs.

Parameters:
  • texts (List[str]) – List of input texts to process.

  • sources (Optional[List[str]], optional) – List of source identifiers. (default: None)

Returns:

List of processed examples with QA pairs and

metadata.

Return type:

List[Dict[str, Any]]

Raises:

ValueError – If length of sources doesn’t match length of texts.

process_text(text: str, source: str = 'user_input') List[Dict[str, Any]][source]#

Process a single text to generate multi-hop QA pairs.

Parameters:
  • text (str) – The input text to process.

  • source (str, optional) – Source identifier for the text. (default: "user_input")

Returns:

List of processed examples with QA pairs and

metadata.

Return type:

List[Dict[str, Any]]

camel.datagen.source2synth.models module#

class camel.datagen.source2synth.models.ContextPrompt(*, main_context: str, related_contexts: List[str] | None = None)[source]#

Bases: BaseModel

A context prompt for generating multi-hop question-answer pairs.

main_context#

The primary context for generating QA pairs.

Type:

str

related_contexts#

Additional related contexts.

Type:

Optional[List[str]]

main_context: str#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'main_context': FieldInfo(annotation=str, required=True, description='The main context for generating the question-answer pair.'), 'related_contexts': FieldInfo(annotation=Union[List[str], NoneType], required=False, default=None, description='Additional contexts related to the main context.')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

related_contexts: List[str] | None#
class camel.datagen.source2synth.models.MultiHopQA(*, question: str, reasoning_steps: List[ReasoningStep], answer: str, supporting_facts: List[str], type: str)[source]#

Bases: BaseModel

A multi-hop question-answer pair with reasoning steps and supporting facts.

question#

The question requiring multi-hop reasoning.

Type:

str

reasoning_steps#

List of reasoning steps to answer.

Type:

List[ReasoningStep]

answer#

The final answer to the question.

Type:

str

supporting_facts#

List of facts supporting the reasoning.

Type:

List[str]

type#

The type of question-answer pair.

Type:

str

class Config[source]#

Bases: object

json_schema_extra: ClassVar[Dict[str, Any]] = {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}#
answer: str#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'json_schema_extra': {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'answer': FieldInfo(annotation=str, required=True, description='The answer to the multi-hop question.'), 'question': FieldInfo(annotation=str, required=True, description='The question that requires multi-hop reasoning.'), 'reasoning_steps': FieldInfo(annotation=List[ReasoningStep], required=True, description='The steps involved in reasoning to answer the question.'), 'supporting_facts': FieldInfo(annotation=List[str], required=True, description='Facts that support the reasoning and answer.'), 'type': FieldInfo(annotation=str, required=True, description='The type of question-answer pair.')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

question: str#
reasoning_steps: List[ReasoningStep]#
supporting_facts: List[str]#
type: str#
class camel.datagen.source2synth.models.ReasoningStep(*, step: str)[source]#

Bases: BaseModel

A single step in a multi-hop reasoning process.

step#

The textual description of the reasoning step.

Type:

str

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'step': FieldInfo(annotation=str, required=True, description='A single step in the reasoning process.')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

step: str#

camel.datagen.source2synth.user_data_processor_config module#

class camel.datagen.source2synth.user_data_processor_config.ProcessorConfig(*, seed: int = None, min_length: int = 50, max_length: int = 512, complexity_threshold: float = 0.5, dataset_size: int = 1000, use_ai_model: bool = True, hop_generating_agent: MultiHopGeneratorAgent = None)[source]#

Bases: BaseModel

Data processing configuration class

complexity_threshold: float#
dataset_size: int#
hop_generating_agent: MultiHopGeneratorAgent#
max_length: int#
min_length: int#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'frozen': False, 'protected_namespaces': (), 'validate_assignment': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'complexity_threshold': FieldInfo(annotation=float, required=False, default=0.5, description='Complexity threshold for processing', metadata=[Ge(ge=0.0), Le(le=1.0)]), 'dataset_size': FieldInfo(annotation=int, required=False, default=1000, description='Target size of the dataset', metadata=[Gt(gt=0)]), 'hop_generating_agent': FieldInfo(annotation=MultiHopGeneratorAgent, required=False, default_factory=<lambda>, description='Agent for generating multi-hop text'), 'max_length': FieldInfo(annotation=int, required=False, default=512, description='Maximum text length', metadata=[Gt(gt=0)]), 'min_length': FieldInfo(annotation=int, required=False, default=50, description='Minimum text length', metadata=[Ge(ge=0)]), 'seed': FieldInfo(annotation=int, required=False, default_factory=<lambda>, description='Random seed for reproducibility'), 'use_ai_model': FieldInfo(annotation=bool, required=False, default=True, description='Whether to use AI model in processing')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

seed: int#
use_ai_model: bool#

Module contents#

class camel.datagen.source2synth.DataCurator(config: ProcessorConfig, rng: Random)[source]#

Bases: object

Manages and curates datasets of multi-hop question-answer pairs.

This class handles dataset management tasks including quality filtering, complexity filtering, deduplication, and dataset sampling.

config#

Configuration for data curation parameters.

Type:

ProcessorConfig

rng#

Random number generator for reproducible sampling.

Type:

random.Random

curate_dataset(examples: List[Dict[str, Any]]) List[Dict[str, Any]][source]#

Manage and curate a dataset through multiple filtering stages.

Parameters:

examples (List[Dict[str, Any]]) – List of examples to curate.

Returns:

Curated dataset meeting quality criteria.

Return type:

List[Dict[str, Any]]

class camel.datagen.source2synth.ExampleConstructor(config: ProcessorConfig, multi_hop_agent: MultiHopGeneratorAgent | None = None)[source]#

Bases: object

Constructs training examples from raw text data.

This class handles the construction of training examples by preprocessing text, extracting information pairs, and generating question-answer pairs.

config#

Configuration for example construction.

Type:

ProcessorConfig

multi_hop_agent#

Agent for QA generation.

Type:

Optional[MultiHopGeneratorAgent]

construct_examples(raw_data: List[Dict[str, Any]]) List[Dict[str, Any]][source]#

Construct training examples from raw data.

Parameters:

raw_data (List[Dict[str, Any]]) – List of raw data dictionaries containing text and metadata.

Returns:

List of constructed examples with QA pairs

and metadata.

Return type:

List[Dict[str, Any]]

class camel.datagen.source2synth.MultiHopQA(*, question: str, reasoning_steps: List[ReasoningStep], answer: str, supporting_facts: List[str], type: str)[source]#

Bases: BaseModel

A multi-hop question-answer pair with reasoning steps and supporting facts.

question#

The question requiring multi-hop reasoning.

Type:

str

reasoning_steps#

List of reasoning steps to answer.

Type:

List[ReasoningStep]

answer#

The final answer to the question.

Type:

str

supporting_facts#

List of facts supporting the reasoning.

Type:

List[str]

type#

The type of question-answer pair.

Type:

str

class Config[source]#

Bases: object

json_schema_extra: ClassVar[Dict[str, Any]] = {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}#
answer: str#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'json_schema_extra': {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'answer': FieldInfo(annotation=str, required=True, description='The answer to the multi-hop question.'), 'question': FieldInfo(annotation=str, required=True, description='The question that requires multi-hop reasoning.'), 'reasoning_steps': FieldInfo(annotation=List[ReasoningStep], required=True, description='The steps involved in reasoning to answer the question.'), 'supporting_facts': FieldInfo(annotation=List[str], required=True, description='Facts that support the reasoning and answer.'), 'type': FieldInfo(annotation=str, required=True, description='The type of question-answer pair.')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

question: str#
reasoning_steps: List[ReasoningStep]#
supporting_facts: List[str]#
type: str#
class camel.datagen.source2synth.ProcessorConfig(*, seed: int = None, min_length: int = 50, max_length: int = 512, complexity_threshold: float = 0.5, dataset_size: int = 1000, use_ai_model: bool = True, hop_generating_agent: MultiHopGeneratorAgent = None)[source]#

Bases: BaseModel

Data processing configuration class

complexity_threshold: float#
dataset_size: int#
hop_generating_agent: MultiHopGeneratorAgent#
max_length: int#
min_length: int#
model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'frozen': False, 'protected_namespaces': (), 'validate_assignment': True}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'complexity_threshold': FieldInfo(annotation=float, required=False, default=0.5, description='Complexity threshold for processing', metadata=[Ge(ge=0.0), Le(le=1.0)]), 'dataset_size': FieldInfo(annotation=int, required=False, default=1000, description='Target size of the dataset', metadata=[Gt(gt=0)]), 'hop_generating_agent': FieldInfo(annotation=MultiHopGeneratorAgent, required=False, default_factory=<lambda>, description='Agent for generating multi-hop text'), 'max_length': FieldInfo(annotation=int, required=False, default=512, description='Maximum text length', metadata=[Gt(gt=0)]), 'min_length': FieldInfo(annotation=int, required=False, default=50, description='Minimum text length', metadata=[Ge(ge=0)]), 'seed': FieldInfo(annotation=int, required=False, default_factory=<lambda>, description='Random seed for reproducibility'), 'use_ai_model': FieldInfo(annotation=bool, required=False, default=True, description='Whether to use AI model in processing')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

seed: int#
use_ai_model: bool#
class camel.datagen.source2synth.ReasoningStep(*, step: str)[source]#

Bases: BaseModel

A single step in a multi-hop reasoning process.

step#

The textual description of the reasoning step.

Type:

str

model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#

A dictionary of computed field names and their corresponding ComputedFieldInfo objects.

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

model_fields: ClassVar[Dict[str, FieldInfo]] = {'step': FieldInfo(annotation=str, required=True, description='A single step in the reasoning process.')}#

Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.

This replaces Model.__fields__ from Pydantic V1.

step: str#
class camel.datagen.source2synth.UserDataProcessor(config: ProcessorConfig | None = None)[source]#

Bases: object

A processor for generating multi-hop question-answer pairs from user data.

This class handles the processing of text data to generate multi-hop question-answer pairs using either an AI model or rule-based approaches. It manages the entire pipeline from text preprocessing to dataset curation.

config#

Configuration for data processing parameters.

Type:

ProcessorConfig

rng#

Random number generator for reproducibility.

Type:

random.Random

multi_hop_agent#

Agent for generating QA pairs.

Type:

Optional[MultiHopGeneratorAgent]

process_batch(texts: List[str], sources: List[str] | None = None) List[Dict[str, Any]][source]#

Process multiple texts in batch to generate multi-hop QA pairs.

Parameters:
  • texts (List[str]) – List of input texts to process.

  • sources (Optional[List[str]], optional) – List of source identifiers. (default: None)

Returns:

List of processed examples with QA pairs and

metadata.

Return type:

List[Dict[str, Any]]

Raises:

ValueError – If length of sources doesn’t match length of texts.

process_text(text: str, source: str = 'user_input') List[Dict[str, Any]][source]#

Process a single text to generate multi-hop QA pairs.

Parameters:
  • text (str) – The input text to process.

  • source (str, optional) – Source identifier for the text. (default: "user_input")

Returns:

List of processed examples with QA pairs and

metadata.

Return type:

List[Dict[str, Any]]