QueryResponse

class QueryResponse(BaseModel):

A structured query response for benchmark evaluation.

This class defines the expected format for model responses to benchmark questions, including explanation, exact answer, and confidence score.

GradingResponse

class GradingResponse(BaseModel):

A structured grading response for evaluating model answers.

This class defines the expected format for grading responses, including extracted answer, reasoning about correctness, binary correctness judgment, and confidence score extraction.

SingleEvalResult

class SingleEvalResult(BaseModel):

Result of evaluating a single benchmark sample.

This class stores the evaluation results for a single benchmark example, including score, HTML representation, conversation history, and metrics.

EvalResult

class EvalResult(BaseModel):

Result of running a complete benchmark evaluation.

This class aggregates results from multiple sample evaluations, storing the overall score, detailed metrics, HTML reports, and conversation logs.

JinjaEnv

class JinjaEnv:

A class that encapsulates the Jinja environment setup.

init

def __init__(self):

Initialize the JinjaEnv instance if not already initialized.

new

def __new__(cls):

Implement singleton pattern to ensure only one instance exists.

get_instance

def get_instance(cls):

Returns:

JinjaEnv: The singleton instance.

env

def env(self):

Returns:

jinja2.Environment: The Jinja environment instance.

from_string

def from_string(self, template_str):

Create a template from the given string.

Parameters:

  • template_str (str): The template string.

Returns:

jinja2.Template: The compiled template.

message_to_html

def message_to_html(message: Message):

Generate HTML snippet (inside a <div>) for a message.

Parameters:

  • message (Message): The message to convert to HTML.

Returns:

str: The HTML representation of the message.

derive_key

def derive_key(password: str, length: int):

Derive a fixed-length key from the password using SHA256.

decrypt

def decrypt(ciphertext_b64: str, password: str):

Decrypt base64-encoded ciphertext with XOR.

_compute_stat

def _compute_stat(values: list, stat: str):

aggregate_results

def aggregate_results(
    single_eval_results: List[SingleEvalResult],
    default_stats: Tuple[str, str] = ('mean', 'std'),
    name2stats: Optional[Dict[str, Tuple[str]]] = None
):

Aggregate results from multiple evaluations into a single EvalResult.

Parameters:

  • single_eval_results (List[SingleEvalResult]): A list of SingleEvalResult objects.
  • default_stats (Tuple[str, str]): A tuple of default statistics to compute. (default: :obj:("mean", "std"))
  • name2stats (Optional[Dict[str, Tuple[str]]]): A dictionary mapping metric names to statistics to compute. (default: :obj:None)

Returns:

EvalResult: An EvalResult object containing aggregated results.

BrowseCompBenchmark

class BrowseCompBenchmark(BaseBenchmark):

BrowseComp Benchmark for evaluating browser-based comprehension tasks.

This benchmark evaluates the ability of language models to comprehend and answer questions based on browser-based content, measuring accuracy and performance.

init

def __init__(
    self,
    save_to: str,
    processes: int = 1,
    num_examples: Optional[int] = None,
    n_repeats: int = 1
):

Initialize the BrowseComp benchmark.

Parameters:

  • save_to (str): The file to save the results.
  • processes (int, optional): The number of processes to use for parallel processing. (default: :obj:1)
  • num_examples (Optional[int]): Number of examples to evaluate. If None, all examples are used. Controls the sample size for testing. (default: :obj:None)
  • n_repeats (int, optional): Number of times to repeat each example. Useful for evaluating consistency across multiple runs. (default: :obj:1)

download

def download(self):

Returns:

self: The benchmark instance

load

def load(self):

Returns:

self: The benchmark instance

train

def train(self):

run

def run(
    self,
    pipeline_template: Union[ChatAgent, RolePlaying, Workforce],
    chat_turn_limit: int = 10,
    roleplaying_summarizer: Optional[ChatAgent] = None,
    task_json_formatter: Optional[ChatAgent] = None
):

Run the benchmark by processing each example in parallel.

This method applies the provided pipeline to each example in the dataset using a process pool for parallel execution. It shows progress using tqdm and stores the results in self._raw_results.

Parameters:

  • pipeline_template (Union[ChatAgent, RolePlaying, Workforce]): The template agent or framework to use for processing examples. Can be a ChatAgent, RolePlaying, or Workforce instance that will be cloned for each example.
  • chat_turn_limit (int): Maximum number of conversation turns allowed when using RolePlaying pipeline. (default: :obj:10)
  • roleplaying_summarizer (Optional[ChatAgent]): Optional ChatAgent to summarize RolePlaying conversations. If None and RolePlaying is used, a default summarizer will be created. (default: :obj:None)
  • task_json_formatter (Optional[ChatAgent]): Optional ChatAgent to format task JSON. If None and Workforce is used, a default formatter will be created. (default: :obj:None)

make_report

def make_report(self, eval_result: EvalResult):

Create a standalone HTML report from an EvalResult.

validate

def validate(self, grader: Optional[ChatAgent] = None):

Validate the raw results using the GRADER_TEMPLATE and ChatAgent.

This method evaluates the correctness of each response by multi-threading. A dedicated chat agent is created in each thread. The chat agent will compare raw result with the expected answer. The grading results will be aggregated in a report.

Parameters:

  • grader: The ChatAgent used for validation. If None, a default agent will be created in each thread. If provided, the provided agent will be used as a template and be cloned into new agents in each thread. (default: :obj:None)