Camel.datagen.source2synth.data processor
UserDataProcessor
A processor for generating multi-hop question-answer pairs from user data.
This class handles the processing of text data to generate multi-hop question-answer pairs using either an AI model or rule-based approaches. It manages the entire pipeline from text preprocessing to dataset curation.
Attributes: config (ProcessorConfig): Configuration for data processing parameters. rng (random.Random): Random number generator for reproducibility. multi_hop_agent (Optional[MultiHopGeneratorAgent]): Agent for generating QA pairs.
init
Initialize the UserDataProcessor.
Parameters:
- config (Optional[ProcessorConfig], optional): Configuration for data processing. (default: :obj:
None
)
process_text
Process a single text to generate multi-hop QA pairs.
Parameters:
- text (str): The input text to process.
- source (str, optional): Source identifier for the text. (default: :obj:
"user_input"
)
Returns:
List[Dict[str, Any]]: List of processed examples with QA pairs and metadata.
process_batch
Process multiple texts in batch to generate multi-hop QA pairs.
Parameters:
- texts (List[str]): List of input texts to process.
- sources (Optional[List[str]], optional): List of source identifiers. (default: :obj:
None
)
Returns:
List[Dict[str, Any]]: List of processed examples with QA pairs and metadata.
ExampleConstructor
Constructs training examples from raw text data.
This class handles the construction of training examples by preprocessing text, extracting information pairs, and generating question-answer pairs.
Attributes: config (ProcessorConfig): Configuration for example construction. multi_hop_agent (Optional[MultiHopGeneratorAgent]): Agent for QA generation.
init
Initialize the ExampleConstructor.
Parameters:
- config (ProcessorConfig): Configuration for example construction.
- multi_hop_agent (Optional[MultiHopGeneratorAgent], optional): Agent for generating multi-hop QA pairs. (default: :obj:
None
)
construct_examples
Construct training examples from raw data.
Parameters:
- raw_data (List[Dict[str, Any]]): List of raw data dictionaries containing text and metadata.
Returns:
List[Dict[str, Any]]: List of constructed examples with QA pairs and metadata.
_preprocess_text
Preprocess input text for example construction.
Parameters:
- text (str): Input text to preprocess.
Returns:
str: Preprocessed text, or empty string if text fails quality checks.
_check_text_quality
Check the quality of input text.
Parameters:
- text (str): Text to check quality for.
Returns:
bool: True if text passes quality checks, False otherwise.
_extract_info_pairs
Extract information pairs and relationships from text.
Parameters:
- text (str): Input text to extract information from.
Returns:
List[Dict[str, Sequence[str]]]: List of dictionaries containing premise, intermediate, conclusion, and related contexts.
_generate_qa_pairs
Generate multi-hop question-answer pairs from information pairs.
Parameters:
- info_pairs (List[Dict[str, Sequence[str]]]): List of information pairs extracted from text.
Returns:
List[Dict[str, str]]: List of generated QA pairs.
_calculate_complexity
Calculate the complexity score for a set of QA pairs.
Parameters:
- qa_pairs (List[Dict[str, Any]]): List of QA pairs to calculate complexity for.
Returns:
float: Complexity score between 0.0 and 1.0.
DataCurator
Manages and curates datasets of multi-hop question-answer pairs.
This class handles dataset management tasks including quality filtering, complexity filtering, deduplication, and dataset sampling.
Attributes: config (ProcessorConfig): Configuration for data curation parameters. rng (random.Random): Random number generator for reproducible sampling.
init
Initialize the DataCurator.
Parameters:
- config (ProcessorConfig): Configuration for data curation.
- rng (random.Random): Random number generator for reproducibility.
curate_dataset
Manage and curate a dataset through multiple filtering stages.
Parameters:
- examples (List[Dict[str, Any]]): List of examples to curate.
Returns:
List[Dict[str, Any]]: Curated dataset meeting quality criteria.
_quality_filter
Filter examples based on quality criteria.
Parameters:
- examples (List[Dict[str, Any]]): List of examples to filter.
Returns:
List[Dict[str, Any]]: Examples that pass quality checks.
_check_qa_quality
Check the quality of question-answer pairs.
Parameters:
- qa_pairs (List[Dict[str, str]]): List of QA pairs to check.
Returns:
bool: True if QA pairs meet quality criteria, False otherwise.
_complexity_filter
Filter examples based on complexity threshold.
Removes examples with complexity scores below the configured threshold.
Parameters:
- examples (List[Dict[str, Any]]): List of examples to filter.
Returns:
List[Dict[str, Any]]: Examples meeting complexity threshold.
_remove_duplicates
Remove duplicate examples from the dataset.
Parameters:
- examples (List[Dict[str, Any]]): List of examples to deduplicate.
Returns:
List[Dict[str, Any]]: Deduplicated examples.
_sample_dataset
Sample examples to match target dataset size.
Parameters:
- examples (List[Dict[str, Any]]): List of examples to sample from.
Returns:
List[Dict[str, Any]]: Sampled dataset of target size or smaller.