Camel.datagen.source2synth.data processor

UserDataProcessor

class UserDataProcessor:

A processor for generating multi-hop question-answer pairs from user data.

This class handles the processing of text data to generate multi-hop question-answer pairs using either an AI model or rule-based approaches. It manages the entire pipeline from text preprocessing to dataset curation.

Attributes: config (ProcessorConfig): Configuration for data processing parameters. rng (random.Random): Random number generator for reproducibility. multi_hop_agent (Optional[MultiHopGeneratorAgent]): Agent for generating QA pairs.

init

def __init__(self, config: Optional[ProcessorConfig] = None):

Initialize the UserDataProcessor.

Parameters:

config (Optional[ProcessorConfig], optional): Configuration for data processing. (default: :obj:None)

process_text

def process_text(self, text: str, source: str = 'user_input'):

Process a single text to generate multi-hop QA pairs.

Parameters:

text (str): The input text to process.
source (str, optional): Source identifier for the text. (default: :obj:"user_input")

Returns:

List[Dict[str, Any]]: List of processed examples with QA pairs and metadata.

process_batch

def process_batch(self, texts: List[str], sources: Optional[List[str]] = None):

Process multiple texts in batch to generate multi-hop QA pairs.

Parameters:

texts (List[str]): List of input texts to process.
sources (Optional[List[str]], optional): List of source identifiers. (default: :obj:None)

Returns:

List[Dict[str, Any]]: List of processed examples with QA pairs and metadata.

ExampleConstructor

class ExampleConstructor:

Constructs training examples from raw text data.

This class handles the construction of training examples by preprocessing text, extracting information pairs, and generating question-answer pairs.

Attributes: config (ProcessorConfig): Configuration for example construction. multi_hop_agent (Optional[MultiHopGeneratorAgent]): Agent for QA generation.

init

def __init__(
    self,
    config: ProcessorConfig,
    multi_hop_agent: Optional[MultiHopGeneratorAgent] = None
):

Initialize the ExampleConstructor.

Parameters:

config (ProcessorConfig): Configuration for example construction.
multi_hop_agent (Optional[MultiHopGeneratorAgent], optional): Agent for generating multi-hop QA pairs. (default: :obj:None)

construct_examples

def construct_examples(self, raw_data: List[Dict[str, Any]]):

Construct training examples from raw data.

Parameters:

raw_data (List[Dict[str, Any]]): List of raw data dictionaries containing text and metadata.

Returns:

List[Dict[str, Any]]: List of constructed examples with QA pairs and metadata.

_preprocess_text

def _preprocess_text(self, text: str):

Preprocess input text for example construction.

Parameters:

text (str): Input text to preprocess.

Returns:

str: Preprocessed text, or empty string if text fails quality checks.

_check_text_quality

def _check_text_quality(self, text: str):

Check the quality of input text.

Parameters:

text (str): Text to check quality for.

Returns:

bool: True if text passes quality checks, False otherwise.

_extract_info_pairs

def _extract_info_pairs(self, text: str):

Extract information pairs and relationships from text.

Parameters:

text (str): Input text to extract information from.

Returns:

List[Dict[str, Sequence[str]]]: List of dictionaries containing premise, intermediate, conclusion, and related contexts.

_generate_qa_pairs

def _generate_qa_pairs(self, info_pairs: List[Dict[str, Sequence[str]]]):

Generate multi-hop question-answer pairs from information pairs.

Parameters:

info_pairs (List[Dict[str, Sequence[str]]]): List of information pairs extracted from text.

Returns:

List[Dict[str, str]]: List of generated QA pairs.

_calculate_complexity

def _calculate_complexity(self, qa_pairs: List[Dict[str, Any]]):

Calculate the complexity score for a set of QA pairs.

Parameters:

qa_pairs (List[Dict[str, Any]]): List of QA pairs to calculate complexity for.

Returns:

float: Complexity score between 0.0 and 1.0.

DataCurator

class DataCurator:

Manages and curates datasets of multi-hop question-answer pairs.

This class handles dataset management tasks including quality filtering, complexity filtering, deduplication, and dataset sampling.

Attributes: config (ProcessorConfig): Configuration for data curation parameters. rng (random.Random): Random number generator for reproducible sampling.

init

def __init__(self, config: ProcessorConfig, rng: random.Random):

Initialize the DataCurator.

Parameters:

config (ProcessorConfig): Configuration for data curation.
rng (random.Random): Random number generator for reproducibility.

curate_dataset

def curate_dataset(self, examples: List[Dict[str, Any]]):

Manage and curate a dataset through multiple filtering stages.

Parameters:

examples (List[Dict[str, Any]]): List of examples to curate.

Returns:

List[Dict[str, Any]]: Curated dataset meeting quality criteria.

_quality_filter

def _quality_filter(self, examples: List[Dict[str, Any]]):

Filter examples based on quality criteria.

Parameters:

examples (List[Dict[str, Any]]): List of examples to filter.

Returns:

List[Dict[str, Any]]: Examples that pass quality checks.

_check_qa_quality

def _check_qa_quality(self, qa_pairs: List[Dict[str, str]]):

Check the quality of question-answer pairs.

Parameters:

qa_pairs (List[Dict[str, str]]): List of QA pairs to check.

Returns:

bool: True if QA pairs meet quality criteria, False otherwise.

_complexity_filter

def _complexity_filter(self, examples: List[Dict[str, Any]]):

Filter examples based on complexity threshold.

Removes examples with complexity scores below the configured threshold.

Parameters:

examples (List[Dict[str, Any]]): List of examples to filter.

Returns:

List[Dict[str, Any]]: Examples meeting complexity threshold.

_remove_duplicates

def _remove_duplicates(self, examples: List[Dict[str, Any]]):

Remove duplicate examples from the dataset.

Parameters:

examples (List[Dict[str, Any]]): List of examples to deduplicate.

Returns:

List[Dict[str, Any]]: Deduplicated examples.

_sample_dataset

def _sample_dataset(self, examples: List[Dict[str, Any]]):

Sample examples to match target dataset size.

Parameters:

examples (List[Dict[str, Any]]): List of examples to sample from.

Returns:

List[Dict[str, Any]]: Sampled dataset of target size or smaller.

Camel.datagen.self instruct.self instruct Camel.datagen.source2synth.models

On this page

UserDataProcessor
init
process_text
process_batch
ExampleConstructor
init
construct_examples
_preprocess_text
_check_text_quality
_extract_info_pairs
_generate_qa_pairs
_calculate_complexity
DataCurator
init
curate_dataset
_quality_filter
_check_qa_quality
_complexity_filter
_remove_duplicates
_sample_dataset

Overview

Agents

Configs

Data Generation

Datasets

Embeddings

Models

Interpreters

Memory

Messages

Prompts

Responses

Retrievers

Societies

Storage

Tasks

Terminators

Toolkits

Types

Verifiers

Bots

Runtime

Utilities

Environments

Extractors

Personas

Benchmarks

Data Collector

Datahubs

Loaders

Schemas

​UserDataProcessor

​init

​process_text

​process_batch

​ExampleConstructor

​init

​construct_examples

​_preprocess_text

​_check_text_quality

​_extract_info_pairs

​_generate_qa_pairs

​_calculate_complexity

​DataCurator

​init

​curate_dataset

​_quality_filter

​_check_qa_quality

​_complexity_filter

​_remove_duplicates

​_sample_dataset

UserDataProcessor

init

process_text

process_batch

ExampleConstructor

init

construct_examples

_preprocess_text

_check_text_quality

_extract_info_pairs

_generate_qa_pairs

_calculate_complexity

DataCurator

init

curate_dataset

_quality_filter

_check_qa_quality

_complexity_filter

_remove_duplicates

_sample_dataset