A structured query response for benchmark evaluation.This class defines the expected format for model responses to benchmark
questions, including explanation, exact answer, and confidence score.
A structured grading response for evaluating model answers.This class defines the expected format for grading responses, including
extracted answer, reasoning about correctness, binary correctness judgment,
and confidence score extraction.
Result of evaluating a single benchmark sample.This class stores the evaluation results for a single benchmark example,
including score, HTML representation, conversation history, and metrics.
Result of running a complete benchmark evaluation.This class aggregates results from multiple sample evaluations, storing
the overall score, detailed metrics, HTML reports, and conversation logs.
BrowseComp Benchmark for evaluating browser-based comprehension tasks.This benchmark evaluates the ability of language models to comprehend and
answer questions based on browser-based content, measuring accuracy and
performance.
def __init__( self, save_to: str, processes: int = 1, num_examples: Optional[int] = None, n_repeats: int = 1):
Initialize the BrowseComp benchmark.Parameters:
save_to (str): The file to save the results.
processes (int, optional): The number of processes to use for parallel processing. (default: :obj:1)
num_examples (Optional[int]): Number of examples to evaluate. If None, all examples are used. Controls the sample size for testing. (default: :obj:None)
n_repeats (int, optional): Number of times to repeat each example. Useful for evaluating consistency across multiple runs. (default: :obj:1)
Run the benchmark by processing each example in parallel.This method applies the provided pipeline to each example in the
dataset using a process pool for parallel execution. It shows progress
using tqdm and stores the results in self._raw_results.Parameters:
pipeline_template (Union[ChatAgent, RolePlaying, Workforce]): The template agent or framework to use for processing examples. Can be a ChatAgent, RolePlaying, or Workforce instance that will be cloned for each example.
chat_turn_limit (int): Maximum number of conversation turns allowed when using RolePlaying pipeline. (default: :obj:10)
roleplaying_summarizer (Optional[ChatAgent]): Optional ChatAgent to summarize RolePlaying conversations. If None and RolePlaying is used, a default summarizer will be created. (default: :obj:None)
task_json_formatter (Optional[ChatAgent]): Optional ChatAgent to format task JSON. If None and Workforce is used, a default formatter will be created. (default: :obj:None)
Validate the raw results using the GRADER_TEMPLATE and ChatAgent.This method evaluates the correctness of each response by
multi-threading. A dedicated chat agent is created in each thread.
The chat agent will compare raw result with the expected answer. The
grading results will be aggregated in a report.Parameters:
grader: The ChatAgent used for validation. If None, a default agent will be created in each thread. If provided, the provided agent will be used as a template and be cloned into new agents in each thread. (default: :obj:None)