BaseGenerator

class BaseGenerator(ABC, IterableDataset):
Abstract base class for data generators. This class defines the interface for generating synthetic datapoints. Concrete implementations should provide specific generation strategies.

init

def __init__(
    self,
    seed: int = 42,
    buffer: int = 20,
    cache: Union[str, Path, None] = None,
    data_path: Union[str, Path, None] = None,
    **kwargs
):
Initialize the base generator. Parameters:
  • seed (int): Random seed for reproducibility. (default: :obj:42) (default: 42)
  • buffer (int): Amount of DataPoints to be generated when the iterator runs out of DataPoints in data. (default: :obj:20)
  • cache (Union[str, Path, None]): Optional path to save generated datapoints during iteration. If None is provided, datapoints will be discarded every 100 generations.
  • data_path (Union[str, Path, None]): Optional path to a JSONL file to initialize the dataset from. **kwargs: Additional generator parameters.

aiter

def __aiter__(self):
Async iterator that yields datapoints dynamically. If a data_path was provided during initialization, those datapoints are yielded first. When self._data is empty, 20 new datapoints are generated. Every 100 yields, the batch is appended to the JSONL file or discarded if cache is None. Yields: DataPoint: A single datapoint.

iter

def __iter__(self):
Synchronous iterator for PyTorch IterableDataset compatibility. If a data_path was provided during initialization, those datapoints are yielded first. When self._data is empty, 20 new datapoints are generated. Every 100 yields, the batch is appended to the JSONL file or discarded if cache is None. Yields: DataPoint: A single datapoint.

sample

def sample(self):
Returns: DataPoint: The next DataPoint. Note: This method is intended for synchronous contexts. Use ‘async_sample’ in asynchronous contexts to avoid blocking or runtime errors.

save_to_jsonl

def save_to_jsonl(self, file_path: Union[str, Path]):
Saves the generated datapoints to a JSONL (JSON Lines) file. Each datapoint is stored as a separate JSON object on a new line. Parameters:
  • file_path (Union[str, Path]): Path to save the JSONL file.
Note:
  • Uses self._data, which contains the generated datapoints.
  • Appends to the file if it already exists.
  • Ensures compatibility with large datasets by using JSONL format.

flush

def flush(self, file_path: Union[str, Path]):
Flush the current data to a JSONL file and clear the data. Parameters:
  • file_path (Union[str, Path]): Path to save the JSONL file.
Note:
  • Uses save_to_jsonl to save self._data.

_init_from_jsonl

def _init_from_jsonl(self, file_path: Path):
Load and parse a dataset from a JSONL file. Parameters:
  • file_path (Path): Path to the JSONL file.
Returns: List[Dict[str, Any]]: A list of datapoint dictionaries.