RagasFields

class RagasFields:

Constants for RAGAS evaluation field names.

annotate_dataset

def annotate_dataset(
    dataset: Dataset,
    context_call: Optional[Callable[[Dict[str, Any]], List[str]]],
    answer_call: Optional[Callable[[Dict[str, Any]], str]]
):

Annotate the dataset by adding context and answers using the provided functions.

Parameters:

  • dataset (Dataset): The input dataset to annotate.
  • context_call (Optional[Callable[[Dict[str, Any]], List[str]]]): Function to generate context for each example.
  • answer_call (Optional[Callable[[Dict[str, Any]], str]]): Function to generate answer for each example.

Returns:

Dataset: The annotated dataset with added contexts and/or answers.

rmse

def rmse(input_trues: Sequence[float], input_preds: Sequence[float]):

Calculate Root Mean Squared Error (RMSE).

Parameters:

  • input_trues (Sequence[float]): Ground truth values.
  • input_preds (Sequence[float]): Predicted values.

Returns:

Optional[float]: RMSE value, or None if inputs have different lengths.

auroc

def auroc(trues: Sequence[bool], preds: Sequence[float]):

Calculate Area Under Receiver Operating Characteristic Curve (AUROC).

Parameters:

  • trues (Sequence[bool]): Ground truth binary values.
  • preds (Sequence[float]): Predicted probability values.

Returns:

float: AUROC score.

ragas_calculate_metrics

def ragas_calculate_metrics(
    dataset: Dataset,
    pred_context_relevance_field: Optional[str],
    pred_faithfulness_field: Optional[str],
    metrics_to_evaluate: Optional[List[str]] = None,
    ground_truth_context_relevance_field: str = 'relevance_score',
    ground_truth_faithfulness_field: str = 'adherence_score'
):

Calculate RAGAS evaluation metrics.

Parameters:

  • dataset (Dataset): The dataset containing predictions and ground truth.
  • pred_context_relevance_field (Optional[str]): Field name for predicted context relevance.
  • pred_faithfulness_field (Optional[str]): Field name for predicted faithfulness.
  • metrics_to_evaluate (Optional[List[str]]): List of metrics to evaluate.
  • ground_truth_context_relevance_field (str): Field name for ground truth relevance.
  • ground_truth_faithfulness_field (str): Field name for ground truth adherence.

Returns:

Dict[str, Optional[float]]: Dictionary of calculated metrics.

ragas_evaluate_dataset

def ragas_evaluate_dataset(
    dataset: Dataset,
    contexts_field_name: Optional[str],
    answer_field_name: Optional[str],
    metrics_to_evaluate: Optional[List[str]] = None
):

Evaluate the dataset using RAGAS metrics.

Parameters:

  • dataset (Dataset): Input dataset to evaluate.
  • contexts_field_name (Optional[str]): Field name containing contexts.
  • answer_field_name (Optional[str]): Field name containing answers.
  • metrics_to_evaluate (Optional[List[str]]): List of metrics to evaluate.

Returns:

Dataset: Dataset with added evaluation metrics.

RAGBenchBenchmark

class RAGBenchBenchmark(BaseBenchmark):

RAGBench Benchmark for evaluating RAG performance.

This benchmark uses the rungalileo/ragbench dataset to evaluate retrieval-augmented generation (RAG) systems. It measures context relevancy and faithfulness metrics as described in https://arxiv.org/abs/2407.11005.

Parameters:

  • processes (int, optional): Number of processes for parallel processing.
  • subset (str, optional): Dataset subset to use (e.g., “hotpotqa”).
  • split (str, optional): Dataset split to use (e.g., “test”).

init

def __init__(
    self,
    processes: int = 1,
    subset: Literal['covidqa', 'cuad', 'delucionqa', 'emanual', 'expertqa', 'finqa', 'hagrid', 'hotpotqa', 'msmarco', 'pubmedqa', 'tatqa', 'techqa'] = 'hotpotqa',
    split: Literal['train', 'test', 'validation'] = 'test'
):

download

def download(self):

Download the RAGBench dataset.

load

def load(self, force_download: bool = False):

Load the RAGBench dataset.

Parameters:

  • force_download (bool, optional): Whether to force download the data.

run

def run(self, agent: ChatAgent, auto_retriever: AutoRetriever):

Run the benchmark evaluation.

Parameters:

  • agent (ChatAgent): Chat agent for generating answers.
  • auto_retriever (AutoRetriever): Retriever for finding relevant contexts.

Returns:

Dict[str, Optional[float]]: Dictionary of evaluation metrics.