> ## Documentation Index
> Fetch the complete documentation index at: https://docs.camel-ai.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Camel.datagen.source2synth.data processor

<a id="camel.datagen.source2synth.data_processor" />

<a id="camel.datagen.source2synth.data_processor.UserDataProcessor" />

## UserDataProcessor

```python theme={"system"}
class UserDataProcessor:
```

A processor for generating multi-hop question-answer pairs from user
data.

This class handles the processing of text data to generate multi-hop
question-answer pairs using either an AI model or rule-based approaches.
It manages the entire pipeline from text preprocessing to dataset curation.

**Parameters:**

* **config** (ProcessorConfig): Configuration for data processing parameters.
* **rng** (random.Random): Random number generator for reproducibility.
* **multi\_hop\_agent** (Optional\[MultiHopGeneratorAgent]): Agent for generating QA pairs.

<a id="camel.datagen.source2synth.data_processor.UserDataProcessor.__init__" />

### **init**

```python theme={"system"}
def __init__(self, config: Optional[ProcessorConfig] = None):
```

Initialize the UserDataProcessor.

**Parameters:**

* **config** (Optional\[ProcessorConfig], optional): Configuration for data processing. (default: :obj:`None`)

<a id="camel.datagen.source2synth.data_processor.UserDataProcessor.process_text" />

### process\_text

```python theme={"system"}
def process_text(self, text: str, source: str = 'user_input'):
```

Process a single text to generate multi-hop QA pairs.

**Parameters:**

* **text** (str): The input text to process.
* **source** (str, optional): Source identifier for the text. (default: :obj:`"user_input"`)

**Returns:**

List\[Dict\[str, Any]]: List of processed examples with QA pairs and
metadata.

<a id="camel.datagen.source2synth.data_processor.UserDataProcessor.process_batch" />

### process\_batch

```python theme={"system"}
def process_batch(self, texts: List[str], sources: Optional[List[str]] = None):
```

Process multiple texts in batch to generate multi-hop QA pairs.

**Parameters:**

* **texts** (List\[str]): List of input texts to process.
* **sources** (Optional\[List\[str]], optional): List of source identifiers. (default: :obj:`None`)

**Returns:**

List\[Dict\[str, Any]]: List of processed examples with QA pairs and
metadata.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor" />

## ExampleConstructor

```python theme={"system"}
class ExampleConstructor:
```

Constructs training examples from raw text data.

This class handles the construction of training examples by preprocessing
text, extracting information pairs, and generating question-answer pairs.

**Parameters:**

* **config** (ProcessorConfig): Configuration for example construction.
* **multi\_hop\_agent** (Optional\[MultiHopGeneratorAgent]): Agent for QA generation.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor.__init__" />

### **init**

```python theme={"system"}
def __init__(
    self,
    config: ProcessorConfig,
    multi_hop_agent: Optional[MultiHopGeneratorAgent] = None
):
```

Initialize the ExampleConstructor.

**Parameters:**

* **config** (ProcessorConfig): Configuration for example construction.
* **multi\_hop\_agent** (Optional\[MultiHopGeneratorAgent], optional): Agent for generating multi-hop QA pairs. (default: :obj:`None`)

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor.construct_examples" />

### construct\_examples

```python theme={"system"}
def construct_examples(self, raw_data: List[Dict[str, Any]]):
```

Construct training examples from raw data.

**Parameters:**

* **raw\_data** (List\[Dict\[str, Any]]): List of raw data dictionaries containing text and metadata.

**Returns:**

List\[Dict\[str, Any]]: List of constructed examples with QA pairs
and metadata.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor._preprocess_text" />

### \_preprocess\_text

```python theme={"system"}
def _preprocess_text(self, text: str):
```

Preprocess input text for example construction.

**Parameters:**

* **text** (str): Input text to preprocess.

**Returns:**

str: Preprocessed text, or empty string if text fails quality
checks.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor._check_text_quality" />

### \_check\_text\_quality

```python theme={"system"}
def _check_text_quality(self, text: str):
```

Check the quality of input text.

**Parameters:**

* **text** (str): Text to check quality for.

**Returns:**

bool: True if text passes quality checks, False otherwise.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor._extract_info_pairs" />

### \_extract\_info\_pairs

```python theme={"system"}
def _extract_info_pairs(self, text: str):
```

Extract information pairs and relationships from text.

**Parameters:**

* **text** (str): Input text to extract information from.

**Returns:**

List\[Dict\[str, Sequence\[str]]]: List of dictionaries containing
premise, intermediate, conclusion, and related contexts.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor._generate_qa_pairs" />

### \_generate\_qa\_pairs

```python theme={"system"}
def _generate_qa_pairs(self, info_pairs: List[Dict[str, Sequence[str]]]):
```

Generate multi-hop question-answer pairs from information pairs.

**Parameters:**

* **info\_pairs** (List\[Dict\[str, Sequence\[str]]]): List of information pairs extracted from text.

**Returns:**

List\[Dict\[str, str]]: List of generated QA pairs.

<a id="camel.datagen.source2synth.data_processor.ExampleConstructor._calculate_complexity" />

### \_calculate\_complexity

```python theme={"system"}
def _calculate_complexity(self, qa_pairs: List[Dict[str, Any]]):
```

Calculate the complexity score for a set of QA pairs.

**Parameters:**

* **qa\_pairs** (List\[Dict\[str, Any]]): List of QA pairs to calculate complexity for.

**Returns:**

float: Complexity score between 0.0 and 1.0.

<a id="camel.datagen.source2synth.data_processor.DataCurator" />

## DataCurator

```python theme={"system"}
class DataCurator:
```

Manages and curates datasets of multi-hop question-answer pairs.

This class handles dataset management tasks including quality filtering,
complexity filtering, deduplication, and dataset sampling.

**Parameters:**

* **config** (ProcessorConfig): Configuration for data curation parameters.
* **rng** (random.Random): Random number generator for reproducible sampling.

<a id="camel.datagen.source2synth.data_processor.DataCurator.__init__" />

### **init**

```python theme={"system"}
def __init__(self, config: ProcessorConfig, rng: random.Random):
```

Initialize the DataCurator.

**Parameters:**

* **config** (ProcessorConfig): Configuration for data curation.
* **rng** (random.Random): Random number generator for reproducibility.

<a id="camel.datagen.source2synth.data_processor.DataCurator.curate_dataset" />

### curate\_dataset

```python theme={"system"}
def curate_dataset(self, examples: List[Dict[str, Any]]):
```

Manage and curate a dataset through multiple filtering stages.

**Parameters:**

* **examples** (List\[Dict\[str, Any]]): List of examples to curate.

**Returns:**

List\[Dict\[str, Any]]: Curated dataset meeting quality criteria.

<a id="camel.datagen.source2synth.data_processor.DataCurator._quality_filter" />

### \_quality\_filter

```python theme={"system"}
def _quality_filter(self, examples: List[Dict[str, Any]]):
```

Filter examples based on quality criteria.

**Parameters:**

* **examples** (List\[Dict\[str, Any]]): List of examples to filter.

**Returns:**

List\[Dict\[str, Any]]: Examples that pass quality checks.

<a id="camel.datagen.source2synth.data_processor.DataCurator._check_qa_quality" />

### \_check\_qa\_quality

```python theme={"system"}
def _check_qa_quality(self, qa_pairs: List[Dict[str, str]]):
```

Check the quality of question-answer pairs.

**Parameters:**

* **qa\_pairs** (List\[Dict\[str, str]]): List of QA pairs to check.

**Returns:**

bool: True if QA pairs meet quality criteria, False otherwise.

<a id="camel.datagen.source2synth.data_processor.DataCurator._complexity_filter" />

### \_complexity\_filter

```python theme={"system"}
def _complexity_filter(self, examples: List[Dict[str, Any]]):
```

Filter examples based on complexity threshold.

Removes examples with complexity scores below the configured threshold.

**Parameters:**

* **examples** (List\[Dict\[str, Any]]): List of examples to filter.

**Returns:**

List\[Dict\[str, Any]]: Examples meeting complexity threshold.

<a id="camel.datagen.source2synth.data_processor.DataCurator._remove_duplicates" />

### \_remove\_duplicates

```python theme={"system"}
def _remove_duplicates(self, examples: List[Dict[str, Any]]):
```

Remove duplicate examples from the dataset.

**Parameters:**

* **examples** (List\[Dict\[str, Any]]): List of examples to deduplicate.

**Returns:**

List\[Dict\[str, Any]]: Deduplicated examples.

<a id="camel.datagen.source2synth.data_processor.DataCurator._sample_dataset" />

### \_sample\_dataset

```python theme={"system"}
def _sample_dataset(self, examples: List[Dict[str, Any]]):
```

Sample examples to match target dataset size.

**Parameters:**

* **examples** (List\[Dict\[str, Any]]): List of examples to sample from.

**Returns:**

List\[Dict\[str, Any]]: Sampled dataset of target size or smaller.
