camel.datasets package#
Submodules#
camel.datasets.base module#
Module contents#
- class camel.datasets.BaseGenerator(seed: int = 42, cache: str | Path | None = None, data_path: str | Path | None = None, **kwargs)[source]#
Bases:
ABC
,IterableDataset
Abstract base class for data generators.
This class defines the interface for generating synthetic datapoints. Concrete implementations should provide specific generation strategies.
- async async_sample() DataPoint [source]#
Returns the next datapoint from the current dataset asynchronously.
- Returns:
The next datapoint.
- Return type:
Note
This method is intended for asynchronous contexts. Use ‘sample’ in synchronous contexts.
- flush(file_path: str | Path) None [source]#
Flush the current data to a JSONL file and clear the data.
- Parameters:
file_path (Union[str, Path]) – Path to save the JSONL file.
Notes
Uses save_to_jsonl to save self._data.
- abstract async generate_new(n: int, **kwargs) List[DataPoint] [source]#
Generate n new datapoints.
- Parameters:
n (int) – Number of datapoints to generate.
**kwargs – Additional generation parameters.
- Returns:
A list of newly generated datapoints.
- Return type:
List[DataPoint]
- sample() DataPoint [source]#
Returns the next datapoint from the current dataset synchronously.
- Raises:
RuntimeError – If called in an async context.
- Returns:
The next DataPoint.
- Return type:
Note
This method is intended for synchronous contexts. Use ‘async_sample’ in asynchronous contexts to avoid blocking or runtime errors.
- save_to_jsonl(file_path: str | Path) None [source]#
Saves the generated datapoints to a JSONL (JSON Lines) file.
Each datapoint is stored as a separate JSON object on a new line.
- Parameters:
file_path (Union[str, Path]) – Path to save the JSONL file.
- Raises:
ValueError – If no datapoints have been generated.
IOError – If there is an issue writing to the file.
Notes
Uses self._data, which contains the generated datapoints.
Appends to the file if it already exists.
Ensures compatibility with large datasets by using JSONL format.
- class camel.datasets.DataPoint(*, question: str, final_answer: str, rationale: str | None = None, metadata: Dict[str, Any] | None = None)[source]#
Bases:
BaseModel
A single data point in the dataset.
- question#
The primary question or issue to be addressed.
- Type:
str
- final_answer#
The final answer.
- Type:
str
- rationale#
Logical reasoning or explanation behind the answer. (default:
None
)- Type:
Optional[str]
- metadata Optional[Dict[str, Any]]
Additional metadata about the data point. (default:
None
)
- final_answer: str#
- classmethod from_dict(data: Dict[str, Any]) DataPoint [source]#
Create a DataPoint from a dictionary.
- Parameters:
data (Dict[str, Any]) – Dictionary containing DataPoint fields.
- Returns:
New DataPoint instance.
- Return type:
- metadata: Dict[str, Any] | None#
- model_computed_fields: ClassVar[Dict[str, ComputedFieldInfo]] = {}#
A dictionary of computed field names and their corresponding ComputedFieldInfo objects.
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- model_fields: ClassVar[Dict[str, FieldInfo]] = {'final_answer': FieldInfo(annotation=str, required=True, description='The final answer.'), 'metadata': FieldInfo(annotation=Union[Dict[str, Any], NoneType], required=False, default=None, description='Additional metadata about the data point.'), 'question': FieldInfo(annotation=str, required=True, description='The primary question or issue to be addressed.'), 'rationale': FieldInfo(annotation=Union[str, NoneType], required=False, default=None, description='Logical reasoning or explanation behind the answer.')}#
Metadata about the fields defined on the model, mapping of field names to [FieldInfo][pydantic.fields.FieldInfo] objects.
This replaces Model.__fields__ from Pydantic V1.
- question: str#
- rationale: str | None#
- class camel.datasets.FewShotGenerator(seed_dataset: StaticDataset, verifier: BaseVerifier, model: BaseModelBackend, seed: int = 42, **kwargs)[source]#
Bases:
BaseGenerator
A generator for creating synthetic datapoints using few-shot learning.
This class leverages a seed dataset, an agent, and a verifier to generate new synthetic datapoints on demand through few-shot prompting.
- async generate_new(n: int, max_retries: int = 10, num_examples: int = 3, **kwargs) List[DataPoint] [source]#
Generates and validates n new datapoints through few-shot prompting, with a retry limit.
- Steps:
Samples examples from the seed dataset.
Constructs a prompt using the selected examples.
3. Uses an agent to generate a new datapoint, consisting of a question and code to solve the question. 4. Executes code using a verifier to get pseudo ground truth. 5. Stores valid datapoints in memory.
- Parameters:
n (int) – Number of valid datapoints to generate.
max_retries (int) – Maximum number of retries before stopping. (default:
10
)num_examples (int) – Number of examples to sample from the
prompting. (seed dataset for few shot) – (default:
3
)**kwargs – Additional generation parameters.
- Returns:
A list of newly generated valid datapoints.
- Return type:
List[DataPoint]
- Raises:
TypeError – If the agent’s output is not a dictionary (or does not match the expected format).
KeyError – If required keys are missing from the response.
AttributeError – If the verifier response lacks attributes.
ValidationError – If a datapoint fails schema validation.
RuntimeError – If retries are exhausted before n valid datapoints are generated.
Notes
- Retries on validation failures until n valid datapoints exist
or max_retries is reached, whichever comes first.
- If retries are exhausted before reaching n, a RuntimeError
is raised.
Metadata includes a timestamp for tracking datapoint creation.
- class camel.datasets.StaticDataset(data: Dataset | Dataset | Path | List[Dict[str, Any]], seed: int = 42, min_samples: int = 1, strict: bool = False, **kwargs)[source]#
Bases:
Dataset
A static dataset containing a list of datapoints. Ensures that all items adhere to the DataPoint schema. This dataset extends
Dataset
from PyTorch and should be used when its size is fixed at runtime.This class can initialize from Hugging Face Datasets, PyTorch Datasets, JSON file paths, or lists of dictionaries, converting them into a consistent internal format.
- property metadata: Dict[str, Any]#
Retrieve dataset metadata.
- Returns:
A copy of the dataset metadata dictionary.
- Return type:
Dict[str, Any]