Camel.datasets.base generator
BaseGenerator
Abstract base class for data generators.
This class defines the interface for generating synthetic datapoints. Concrete implementations should provide specific generation strategies.
init
Initialize the base generator.
Parameters:
- seed (int): Random seed for reproducibility. (default: :obj:
42
) (default: 42) - buffer (int): Amount of DataPoints to be generated when the iterator runs out of DataPoints in data. (default: :obj:
20
) - cache (Union[str, Path, None]): Optional path to save generated datapoints during iteration. If None is provided, datapoints will be discarded every 100 generations.
- data_path (Union[str, Path, None]): Optional path to a JSONL file to initialize the dataset from. **kwargs: Additional generator parameters.
aiter
Async iterator that yields datapoints dynamically.
If a data_path
was provided during initialization, those datapoints
are yielded first. When self._data is empty, 20 new datapoints
are generated. Every 100 yields, the batch is appended to the
JSONL file or discarded if cache
is None.
Yields: DataPoint: A single datapoint.
iter
Synchronous iterator for PyTorch IterableDataset compatibility.
If a data_path
was provided during initialization, those datapoints
are yielded first. When self._data is empty, 20 new datapoints
are generated. Every 100 yields, the batch is appended to the
JSONL file or discarded if cache
is None.
Yields: DataPoint: A single datapoint.
sample
Returns:
DataPoint: The next DataPoint.
Note: This method is intended for synchronous contexts. Use ‘async_sample’ in asynchronous contexts to avoid blocking or runtime errors.
save_to_jsonl
Saves the generated datapoints to a JSONL (JSON Lines) file.
Each datapoint is stored as a separate JSON object on a new line.
Parameters:
- file_path (Union[str, Path]): Path to save the JSONL file.
flush
Flush the current data to a JSONL file and clear the data.
Parameters:
- file_path (Union[str, Path]): Path to save the JSONL file.
- Notes: - Uses
save_to_jsonl
to saveself._data
.
_init_from_jsonl
Load and parse a dataset from a JSONL file.
Parameters:
- file_path (Path): Path to the JSONL file.
Returns:
List[Dict[str, Any]]: A list of datapoint dictionaries.