StaticDataset

class StaticDataset(Dataset):

A static dataset containing a list of datapoints. Ensures that all items adhere to the DataPoint schema. This dataset extends :obj:Dataset from PyTorch and should be used when its size is fixed at runtime.

This class can initialize from Hugging Face Datasets, PyTorch Datasets, JSON file paths, or lists of dictionaries, converting them into a consistent internal format.

init

def __init__(
    self,
    data: Union[HFDataset, Dataset, Path, List[Dict[str, Any]]],
    seed: int = 42,
    min_samples: int = 1,
    strict: bool = False,
    **kwargs
):

Initialize the static dataset and validate integrity.

Parameters:

  • data (Union[HFDataset, Dataset, Path, List[Dict[str, Any]]]): Input data, which can be one of the following: - A Hugging Face Dataset (:obj:HFDataset). - A PyTorch Dataset (:obj:torch.utils.data.Dataset). - A :obj:Path object representing a JSON or JSONL file. - A list of dictionaries with :obj:DataPoint-compatible fields.
  • seed (int): Random seed for reproducibility. (default: :obj:42)
  • min_samples (int): Minimum required number of samples. (default: :obj:1)
  • strict (bool): Whether to raise an error on invalid datapoints (:obj:True) or skip/filter them (:obj:False). (default: :obj:False) **kwargs: Additional dataset parameters.

_init_data

def _init_data(
    self,
    data: Union[HFDataset, Dataset, Path, List[Dict[str, Any]]]
):

Convert input data from various formats into a list of :obj:DataPoint instances.

Parameters:

  • data (Union[HFDataset, Dataset, Path, List[Dict[str, Any]]]): Input dataset in one of the supported formats.

Returns:

List[DataPoint]: A list of validated :obj:DataPoint instances.

len

def __len__(self):

Return the size of the dataset.

getitem

def __getitem__(self, idx: Union[int, slice]):

Retrieve a datapoint or a batch of datapoints by index or slice.

Parameters:

  • idx (Union[int, slice]): Index or slice of the datapoint(s).

Returns:

List[DataPoint]: A list of DataPoint objects.

sample

def sample(self):

Returns:

DataPoint: A randomly sampled :obj:DataPoint.

metadata

def metadata(self):

Returns:

Dict[str, Any]: A copy of the dataset metadata dictionary.

_init_from_hf_dataset

def _init_from_hf_dataset(self, data: HFDataset):

Convert a Hugging Face dataset into a list of dictionaries.

Parameters:

  • data (HFDataset): A Hugging Face dataset.

Returns:

List[Dict[str, Any]]: A list of dictionaries representing the dataset, where each dictionary corresponds to a datapoint.

_init_from_pytorch_dataset

def _init_from_pytorch_dataset(self, data: Dataset):

Convert a PyTorch dataset into a list of dictionaries.

Parameters:

  • data (Dataset): A PyTorch dataset.

Returns:

List[Dict[str, Any]]: A list of dictionaries representing the dataset.

_init_from_json_path

def _init_from_json_path(self, data: Path):

Load and parse a dataset from a JSON file.

Parameters:

  • data (Path): Path to the JSON file.

Returns:

List[Dict[str, Any]]: A list of datapoint dictionaries.

_init_from_jsonl_path

def _init_from_jsonl_path(self, data: Path):

Load and parse a dataset from a JSONL file.

Parameters:

  • data (Path): Path to the JSONL file.

Returns:

List[Dict[str, Any]]: A list of datapoint dictionaries.

_init_from_list

def _init_from_list(self, data: List[Dict[str, Any]]):

Validate and convert a list of dictionaries into a dataset.

Parameters:

  • data (List[Dict[str, Any]]): A list of dictionaries where each dictionary must be a valid :obj:DataPoint.

Returns:

List[Dict[str, Any]]: The validated list of dictionaries.

save_to_json

def save_to_json(self, file_path: Union[str, Path]):

Save the dataset to a local JSON file.

Parameters:

  • file_path (Union[str, Path]): Path to the output JSON file. If a string is provided, it will be converted to a Path object.

save_to_huggingface

def save_to_huggingface(
    self,
    dataset_name: str,
    token: Optional[str] = None,
    filepath: str = 'records/records.json',
    private: bool = False,
    description: Optional[str] = None,
    license: Optional[str] = None,
    version: Optional[str] = None,
    tags: Optional[List[str]] = None,
    language: Optional[List[str]] = None,
    task_categories: Optional[List[str]] = None,
    authors: Optional[List[str]] = None,
    **kwargs: Any
):

Save the dataset to the Hugging Face Hub using the project’s HuggingFaceDatasetManager.

Parameters:

  • dataset_name (str): The name of the dataset on Hugging Face Hub. Should be in the format ‘username/dataset_name’ .
  • token (Optional[str]): The Hugging Face API token. If not provided, the token will be read from the environment variable HF_TOKEN (default: :obj:None)
  • filepath (str): The path in the repository where the dataset will be saved. (default: :obj:"records/records.json")
  • private (bool): Whether the dataset should be private. (default: :obj:False)
  • description (Optional[str]): A description of the dataset. (default: :obj:None)
  • license (Optional[str]): The license of the dataset. (default: :obj:None)
  • version (Optional[str]): The version of the dataset. (default: :obj:None)
  • tags (Optional[List[str]]): A list of tags for the dataset. (default: :obj:None)
  • language (Optional[List[str]]): A list of languages the dataset is in. (default: :obj:None)
  • task_categories (Optional[List[str]]): A list of task categories. (default: :obj:None)
  • authors (Optional[List[str]]): A list of authors of the dataset. (default: :obj:None) **kwargs (Any): Additional keyword arguments to pass to the Hugging Face API.

Returns:

str: The URL of the dataset on the Hugging Face Hub.