camel.datasets package#
Submodules#
camel.datasets.base module#
Module contents#
- class camel.datasets.BaseGenerator(seed: int = 42, buffer: int = 20, cache: str | Path | None = None, data_path: str | Path | None = None, **kwargs)[source]#
Bases:
ABC
,IterableDataset
Abstract base class for data generators.
This class defines the interface for generating synthetic datapoints. Concrete implementations should provide specific generation strategies.
- async async_sample() DataPoint [source]#
Returns the next datapoint from the current dataset asynchronously.
- Returns:
The next datapoint.
- Return type:
Note
This method is intended for asynchronous contexts. Use βsampleβ in synchronous contexts.
- flush(file_path: str | Path) None [source]#
Flush the current data to a JSONL file and clear the data.
- Parameters:
file_path (Union[str, Path]) β Path to save the JSONL file.
Notes
Uses save_to_jsonl to save self._data.
- abstract async generate_new(n: int, **kwargs) None [source]#
Generate n new datapoints and append them to self._data.
Subclass implementations must generate the specified number of datapoints and append them directly to the self._data list. This method should not return the datapoints; the iterator relies on self._data being populated.
- Parameters:
n (int) β Number of datapoints to generate and append.
**kwargs β Additional generation parameters.
- Returns:
This method should not return anything.
- Return type:
None
Example
```python async def generate_new(self, n: int, **kwargs) -> None:
new_points = [DataPoint(β¦) for _ in range(n)] self._data.extend(new_points)
- sample() DataPoint [source]#
Returns the next datapoint from the current dataset synchronously.
- Raises:
RuntimeError β If called in an async context.
- Returns:
The next DataPoint.
- Return type:
Note
This method is intended for synchronous contexts. Use βasync_sampleβ in asynchronous contexts to avoid blocking or runtime errors.
- save_to_jsonl(file_path: str | Path) None [source]#
Saves the generated datapoints to a JSONL (JSON Lines) file.
Each datapoint is stored as a separate JSON object on a new line.
- Parameters:
file_path (Union[str, Path]) β Path to save the JSONL file.
- Raises:
ValueError β If no datapoints have been generated.
IOError β If there is an issue writing to the file.
Notes
Uses self._data, which contains the generated datapoints.
Appends to the file if it already exists.
Ensures compatibility with large datasets by using JSONL format.
- class camel.datasets.DataPoint(*, question: str, final_answer: str, rationale: str | None = None, metadata: Dict[str, Any] | None = None)[source]#
Bases:
BaseModel
A single data point in the dataset.
- question#
The primary question or issue to be addressed.
- Type:
str
- final_answer#
The final answer.
- Type:
str
- rationale#
Logical reasoning or explanation behind the answer. (default:
None
)- Type:
Optional[str]
- metadata Optional[Dict[str, Any]]
Additional metadata about the data point. (default:
None
)
- final_answer: str#
- classmethod from_dict(data: Dict[str, Any]) DataPoint [source]#
Create a DataPoint from a dictionary.
- Parameters:
data (Dict[str, Any]) β Dictionary containing DataPoint fields.
- Returns:
New DataPoint instance.
- Return type:
- metadata: Dict[str, Any] | None#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- question: str#
- rationale: str | None#
- class camel.datasets.FewShotGenerator(seed_dataset: StaticDataset, verifier: BaseVerifier, model: BaseModelBackend, seed: int = 42, **kwargs)[source]#
Bases:
BaseGenerator
A generator for creating synthetic datapoints using few-shot learning.
This class leverages a seed dataset, an agent, and a verifier to generate new synthetic datapoints on demand through few-shot prompting.
- async generate_new(n: int, max_retries: int = 10, num_examples: int = 3, **kwargs) None [source]#
Generates and validates n new datapoints through few-shot prompting, with a retry limit.
- Steps:
Samples examples from the seed dataset.
Constructs a prompt using the selected examples.
3. Uses an agent to generate a new datapoint, consisting of a question and code to solve the question. 4. Executes code using a verifier to get pseudo ground truth. 5. Stores valid datapoints in memory.
- Parameters:
n (int) β Number of valid datapoints to generate.
max_retries (int) β Maximum number of retries before stopping. (default:
10
)num_examples (int) β Number of examples to sample from the
prompting. (seed dataset for few shot) β (default:
3
)**kwargs β Additional generation parameters.
- Returns:
A list of newly generated valid datapoints.
- Return type:
List[DataPoint]
- Raises:
TypeError β If the agentβs output is not a dictionary (or does not match the expected format).
KeyError β If required keys are missing from the response.
AttributeError β If the verifier response lacks attributes.
ValidationError β If a datapoint fails schema validation.
RuntimeError β If retries are exhausted before n valid datapoints are generated.
Notes
- Retries on validation failures until n valid datapoints exist
or max_retries is reached, whichever comes first.
- If retries are exhausted before reaching n, a RuntimeError
is raised.
Metadata includes a timestamp for tracking datapoint creation.
- class camel.datasets.SelfInstructGenerator(seed_dataset: StaticDataset, verifier: BaseVerifier, instruction_agent: ChatAgent | None = None, rationale_agent: ChatAgent | None = None, seed: int = 42, **kwargs)[source]#
Bases:
BaseGenerator
A generator for creating synthetic datapoints using self-instruct.
It utilizes both a human-provided dataset (seed_dataset) and generated machine instructions (machine_instructions) to produce new, synthetic datapoints that include a question, a computed rationale (code), and a final answer (from a verifier).
- class QuestionSchema(*, question: str)[source]#
Bases:
BaseModel
Schema for the generated question.
- question#
The question generated by the model.
- Type:
str
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- question: str#
- class RationaleSchema(*, code: str)[source]#
Bases:
BaseModel
Schema for the generated rationale code.
- code#
The generated code without any formatting.
- Type:
str
- code: str#
- model_config: ClassVar[ConfigDict] = {}#
Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
- default_instruction_agent() ChatAgent [source]#
Create the default instruction generation agent.
This agent is configured with a moderate temperature setting to encourage creative and diverse instruction generation behavior.
- Returns:
An agent with the default instruction prompt.
- Return type:
- default_rationale_agent() ChatAgent [source]#
Create the default rationale generation agent.
This agent is configured with a deterministic (zero temperature) setting to ensure consistent and precise rationale generation based on a given instruction and package list.
- Returns:
An agent with the rationale prompt
- Return type:
- static format_support_block(dp: DataPoint) str [source]#
Format a DataPoint into a few-shot example block.
- Parameters:
dp (DataPoint) β A data point.
- Returns:
- A formatted string containing the question and its
corresponding code block in Markdown-style Python format.
- Return type:
str
- async generate_new(n: int, max_retries: int = 10, human_sample_count: int = 3, machine_sample_count: int = 1, **kwargs) None [source]#
Generates and validates n new datapoints through self-instruct prompting, with a retry limit.
- Parameters:
n (int) β The number of valid datapoints to generate.
max_retries (int) β Maximum number of retries before stopping. (default:
10
)human_sample_count (int) β Number of human examples to sample. (default:
3
)machine_sample_count (int) β Number of machine examples to sample. (default:
1
)**kwargs β Additional keyword arguments.
Notes
- Retries on validation failures until n valid datapoints exist
or max_retries is reached, whichever comes first.
- If retries are exhausted before reaching n, a RuntimeError
is raised.
Metadata includes a timestamp for tracking datapoint creation.
- generate_new_instruction(agent: ChatAgent, support_human_dps: list[DataPoint], support_machine_dps: list[DataPoint]) str [source]#
Generate a new instruction using self-instruct prompting.
- generate_rationale(question: str, agent: ChatAgent | None = None, support_human_dps: list[DataPoint] | None = None) str [source]#
Generate rationale code (solution) for the given question.
- Parameters:
question (str) β The question to be solved.
agent (Optional[ChatAgent]) β The agent to use for generating the rationale. If None is provided, the default rationale agent will be used. (default:
None
)support_human_dps (Optional[list[DataPoint]]) β List of human examples to sample. (default:
None
)
- Returns:
The generated code solution as a string.
- Return type:
str
- class camel.datasets.StaticDataset(data: Dataset | Dataset | Path | List[Dict[str, Any]], seed: int = 42, min_samples: int = 1, strict: bool = False, **kwargs)[source]#
Bases:
Dataset
A static dataset containing a list of datapoints. Ensures that all items adhere to the DataPoint schema. This dataset extends
Dataset
from PyTorch and should be used when its size is fixed at runtime.This class can initialize from Hugging Face Datasets, PyTorch Datasets, JSON file paths, or lists of dictionaries, converting them into a consistent internal format.
- property metadata: Dict[str, Any]#
Retrieve dataset metadata.
- Returns:
A copy of the dataset metadata dictionary.
- Return type:
Dict[str, Any]