BaseGenerator

class BaseGenerator(ABC, IterableDataset):

Abstract base class for data generators.

This class defines the interface for generating synthetic datapoints. Concrete implementations should provide specific generation strategies.

init

def __init__(
    self,
    seed: int = 42,
    buffer: int = 20,
    cache: Union[str, Path, None] = None,
    data_path: Union[str, Path, None] = None,
    **kwargs
):

Initialize the base generator.

Parameters:

  • seed (int): Random seed for reproducibility. (default: :obj:42) (default: 42)
  • buffer (int): Amount of DataPoints to be generated when the iterator runs out of DataPoints in data. (default: :obj:20)
  • cache (Union[str, Path, None]): Optional path to save generated datapoints during iteration. If None is provided, datapoints will be discarded every 100 generations.
  • data_path (Union[str, Path, None]): Optional path to a JSONL file to initialize the dataset from. **kwargs: Additional generator parameters.

aiter

def __aiter__(self):

Async iterator that yields datapoints dynamically.

If a data_path was provided during initialization, those datapoints are yielded first. When self._data is empty, 20 new datapoints are generated. Every 100 yields, the batch is appended to the JSONL file or discarded if cache is None.

Yields: DataPoint: A single datapoint.

iter

def __iter__(self):

Synchronous iterator for PyTorch IterableDataset compatibility.

If a data_path was provided during initialization, those datapoints are yielded first. When self._data is empty, 20 new datapoints are generated. Every 100 yields, the batch is appended to the JSONL file or discarded if cache is None.

Yields: DataPoint: A single datapoint.

sample

def sample(self):

Returns:

DataPoint: The next DataPoint.

Note: This method is intended for synchronous contexts. Use ‘async_sample’ in asynchronous contexts to avoid blocking or runtime errors.

save_to_jsonl

def save_to_jsonl(self, file_path: Union[str, Path]):

Saves the generated datapoints to a JSONL (JSON Lines) file.

Each datapoint is stored as a separate JSON object on a new line.

Parameters:

  • file_path (Union[str, Path]): Path to save the JSONL file.

flush

def flush(self, file_path: Union[str, Path]):

Flush the current data to a JSONL file and clear the data.

Parameters:

  • file_path (Union[str, Path]): Path to save the JSONL file.
  • Notes: - Uses save_to_jsonl to save self._data.

_init_from_jsonl

def _init_from_jsonl(self, file_path: Path):

Load and parse a dataset from a JSONL file.

Parameters:

  • file_path (Path): Path to the JSONL file.

Returns:

List[Dict[str, Any]]: A list of datapoint dictionaries.