camel.datasets package#

Submodules#

camel.datasets.base module#

Module contents#

class camel.datasets.BaseGenerator(seed: int = 42, buffer: int = 20, cache: str | Path | None = None, data_path: str | Path | None = None, **kwargs)[source]#

Bases: ABC, IterableDataset

Abstract base class for data generators.

This class defines the interface for generating synthetic datapoints. Concrete implementations should provide specific generation strategies.

async async_sample() DataPoint[source]#

Returns the next datapoint from the current dataset asynchronously.

Returns:

The next datapoint.

Return type:

DataPoint

Note

This method is intended for asynchronous contexts. Use β€˜sample’ in synchronous contexts.

flush(file_path: str | Path) None[source]#

Flush the current data to a JSONL file and clear the data.

Parameters:

file_path (Union[str, Path]) – Path to save the JSONL file.

Notes

  • Uses save_to_jsonl to save self._data.

abstract async generate_new(n: int, **kwargs) None[source]#

Generate n new datapoints and append them to self._data.

Subclass implementations must generate the specified number of datapoints and append them directly to the self._data list. This method should not return the datapoints; the iterator relies on self._data being populated.

Parameters:
  • n (int) – Number of datapoints to generate and append.

  • **kwargs – Additional generation parameters.

Returns:

This method should not return anything.

Return type:

None

Example

```python async def generate_new(self, n: int, **kwargs) -> None:

new_points = [DataPoint(…) for _ in range(n)] self._data.extend(new_points)

```

sample() DataPoint[source]#

Returns the next datapoint from the current dataset synchronously.

Raises:

RuntimeError – If called in an async context.

Returns:

The next DataPoint.

Return type:

DataPoint

Note

This method is intended for synchronous contexts. Use β€˜async_sample’ in asynchronous contexts to avoid blocking or runtime errors.

save_to_jsonl(file_path: str | Path) None[source]#

Saves the generated datapoints to a JSONL (JSON Lines) file.

Each datapoint is stored as a separate JSON object on a new line.

Parameters:

file_path (Union[str, Path]) – Path to save the JSONL file.

Raises:
  • ValueError – If no datapoints have been generated.

  • IOError – If there is an issue writing to the file.

Notes

  • Uses self._data, which contains the generated datapoints.

  • Appends to the file if it already exists.

  • Ensures compatibility with large datasets by using JSONL format.

class camel.datasets.DataPoint(*, question: str, final_answer: str, rationale: str | None = None, metadata: Dict[str, Any] | None = None)[source]#

Bases: BaseModel

A single data point in the dataset.

question#

The primary question or issue to be addressed.

Type:

str

final_answer#

The final answer.

Type:

str

rationale#

Logical reasoning or explanation behind the answer. (default: None)

Type:

Optional[str]

metadata Optional[Dict[str, Any]]

Additional metadata about the data point. (default: None)

final_answer: str#
classmethod from_dict(data: Dict[str, Any]) DataPoint[source]#

Create a DataPoint from a dictionary.

Parameters:

data (Dict[str, Any]) – Dictionary containing DataPoint fields.

Returns:

New DataPoint instance.

Return type:

DataPoint

metadata: Dict[str, Any] | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

question: str#
rationale: str | None#
to_dict() Dict[str, Any][source]#

Convert DataPoint to a dictionary.

Returns:

Dictionary representation of the DataPoint.

Return type:

Dict[str, Any]

class camel.datasets.FewShotGenerator(seed_dataset: StaticDataset, verifier: BaseVerifier, model: BaseModelBackend, seed: int = 42, **kwargs)[source]#

Bases: BaseGenerator

A generator for creating synthetic datapoints using few-shot learning.

This class leverages a seed dataset, an agent, and a verifier to generate new synthetic datapoints on demand through few-shot prompting.

async generate_new(n: int, max_retries: int = 10, num_examples: int = 3, **kwargs) None[source]#

Generates and validates n new datapoints through few-shot prompting, with a retry limit.

Steps:
  1. Samples examples from the seed dataset.

  2. Constructs a prompt using the selected examples.

3. Uses an agent to generate a new datapoint, consisting of a question and code to solve the question. 4. Executes code using a verifier to get pseudo ground truth. 5. Stores valid datapoints in memory.

Parameters:
  • n (int) – Number of valid datapoints to generate.

  • max_retries (int) – Maximum number of retries before stopping. (default: 10)

  • num_examples (int) – Number of examples to sample from the

  • prompting. (seed dataset for few shot) – (default: 3)

  • **kwargs – Additional generation parameters.

Returns:

A list of newly generated valid datapoints.

Return type:

List[DataPoint]

Raises:
  • TypeError – If the agent’s output is not a dictionary (or does not match the expected format).

  • KeyError – If required keys are missing from the response.

  • AttributeError – If the verifier response lacks attributes.

  • ValidationError – If a datapoint fails schema validation.

  • RuntimeError – If retries are exhausted before n valid datapoints are generated.

Notes

  • Retries on validation failures until n valid datapoints exist

    or max_retries is reached, whichever comes first.

  • If retries are exhausted before reaching n, a RuntimeError

    is raised.

  • Metadata includes a timestamp for tracking datapoint creation.

class camel.datasets.SelfInstructGenerator(seed_dataset: StaticDataset, verifier: BaseVerifier, instruction_agent: ChatAgent | None = None, rationale_agent: ChatAgent | None = None, seed: int = 42, **kwargs)[source]#

Bases: BaseGenerator

A generator for creating synthetic datapoints using self-instruct.

It utilizes both a human-provided dataset (seed_dataset) and generated machine instructions (machine_instructions) to produce new, synthetic datapoints that include a question, a computed rationale (code), and a final answer (from a verifier).

class QuestionSchema(*, question: str)[source]#

Bases: BaseModel

Schema for the generated question.

question#

The question generated by the model.

Type:

str

model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

question: str#
class RationaleSchema(*, code: str)[source]#

Bases: BaseModel

Schema for the generated rationale code.

code#

The generated code without any formatting.

Type:

str

code: str#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

default_instruction_agent() ChatAgent[source]#

Create the default instruction generation agent.

This agent is configured with a moderate temperature setting to encourage creative and diverse instruction generation behavior.

Returns:

An agent with the default instruction prompt.

Return type:

ChatAgent

default_rationale_agent() ChatAgent[source]#

Create the default rationale generation agent.

This agent is configured with a deterministic (zero temperature) setting to ensure consistent and precise rationale generation based on a given instruction and package list.

Returns:

An agent with the rationale prompt

Return type:

ChatAgent

static format_support_block(dp: DataPoint) str[source]#

Format a DataPoint into a few-shot example block.

Parameters:

dp (DataPoint) – A data point.

Returns:

A formatted string containing the question and its

corresponding code block in Markdown-style Python format.

Return type:

str

async generate_new(n: int, max_retries: int = 10, human_sample_count: int = 3, machine_sample_count: int = 1, **kwargs) None[source]#

Generates and validates n new datapoints through self-instruct prompting, with a retry limit.

Parameters:
  • n (int) – The number of valid datapoints to generate.

  • max_retries (int) – Maximum number of retries before stopping. (default: 10)

  • human_sample_count (int) – Number of human examples to sample. (default: 3)

  • machine_sample_count (int) – Number of machine examples to sample. (default: 1)

  • **kwargs – Additional keyword arguments.

Notes

  • Retries on validation failures until n valid datapoints exist

    or max_retries is reached, whichever comes first.

  • If retries are exhausted before reaching n, a RuntimeError

    is raised.

  • Metadata includes a timestamp for tracking datapoint creation.

generate_new_instruction(agent: ChatAgent, support_human_dps: list[DataPoint], support_machine_dps: list[DataPoint]) str[source]#

Generate a new instruction using self-instruct prompting.

Parameters:
  • agent (ChatAgent) – The agent to use for generating the instruction.

  • support_human_dps (list[DataPoint]) – List of human examples to sample.

  • support_machine_dps (list[DataPoint]) – List of machine examples to sample.

Returns:

The newly generated question.

Return type:

str

generate_rationale(question: str, agent: ChatAgent | None = None, support_human_dps: list[DataPoint] | None = None) str[source]#

Generate rationale code (solution) for the given question.

Parameters:
  • question (str) – The question to be solved.

  • agent (Optional[ChatAgent]) – The agent to use for generating the rationale. If None is provided, the default rationale agent will be used. (default: None)

  • support_human_dps (Optional[list[DataPoint]]) – List of human examples to sample. (default: None)

Returns:

The generated code solution as a string.

Return type:

str

class camel.datasets.StaticDataset(data: Dataset | Dataset | Path | List[Dict[str, Any]], seed: int = 42, min_samples: int = 1, strict: bool = False, **kwargs)[source]#

Bases: Dataset

A static dataset containing a list of datapoints. Ensures that all items adhere to the DataPoint schema. This dataset extends Dataset from PyTorch and should be used when its size is fixed at runtime.

This class can initialize from Hugging Face Datasets, PyTorch Datasets, JSON file paths, or lists of dictionaries, converting them into a consistent internal format.

property metadata: Dict[str, Any]#

Retrieve dataset metadata.

Returns:

A copy of the dataset metadata dictionary.

Return type:

Dict[str, Any]

sample() DataPoint[source]#

Sample a random datapoint from the dataset.

Returns:

A randomly sampled DataPoint.

Return type:

DataPoint

Raises:

RuntimeError – If the dataset is empty and no samples can be drawn.