Camel.benchmarks.browsecomp
QueryResponse
A structured query response for benchmark evaluation.
This class defines the expected format for model responses to benchmark questions, including explanation, exact answer, and confidence score.
GradingResponse
A structured grading response for evaluating model answers.
This class defines the expected format for grading responses, including extracted answer, reasoning about correctness, binary correctness judgment, and confidence score extraction.
SingleEvalResult
Result of evaluating a single benchmark sample.
This class stores the evaluation results for a single benchmark example, including score, HTML representation, conversation history, and metrics.
EvalResult
Result of running a complete benchmark evaluation.
This class aggregates results from multiple sample evaluations, storing the overall score, detailed metrics, HTML reports, and conversation logs.
JinjaEnv
A class that encapsulates the Jinja environment setup.
init
Initialize the JinjaEnv instance if not already initialized.
new
Implement singleton pattern to ensure only one instance exists.
get_instance
Returns:
JinjaEnv: The singleton instance.
env
Returns:
jinja2.Environment: The Jinja environment instance.
from_string
Create a template from the given string.
Parameters:
- template_str (str): The template string.
Returns:
jinja2.Template: The compiled template.
message_to_html
Generate HTML snippet (inside a <div>
) for a message.
Parameters:
- message (Message): The message to convert to HTML.
Returns:
str: The HTML representation of the message.
derive_key
Derive a fixed-length key from the password using SHA256.
decrypt
Decrypt base64-encoded ciphertext with XOR.
_compute_stat
aggregate_results
Aggregate results from multiple evaluations into a single EvalResult.
Parameters:
- single_eval_results (List[SingleEvalResult]): A list of
SingleEvalResult
objects. - default_stats (Tuple[str, str]): A tuple of default statistics to compute. (default: :obj:
("mean", "std")
) - name2stats (Optional[Dict[str, Tuple[str]]]): A dictionary mapping metric names to statistics to compute. (default: :obj:
None
)
Returns:
EvalResult: An EvalResult
object containing aggregated results.
BrowseCompBenchmark
BrowseComp Benchmark for evaluating browser-based comprehension tasks.
This benchmark evaluates the ability of language models to comprehend and answer questions based on browser-based content, measuring accuracy and performance.
init
Initialize the BrowseComp benchmark.
Parameters:
- save_to (str): The file to save the results.
- processes (int, optional): The number of processes to use for parallel processing. (default: :obj:
1
) - num_examples (Optional[int]): Number of examples to evaluate. If None, all examples are used. Controls the sample size for testing. (default: :obj:
None
) - n_repeats (int, optional): Number of times to repeat each example. Useful for evaluating consistency across multiple runs. (default: :obj:
1
)
download
Returns:
self: The benchmark instance
load
Returns:
self: The benchmark instance
train
run
Run the benchmark by processing each example in parallel.
This method applies the provided pipeline to each example in the dataset using a process pool for parallel execution. It shows progress using tqdm and stores the results in self._raw_results.
Parameters:
- pipeline_template (Union[ChatAgent, RolePlaying, Workforce]): The template agent or framework to use for processing examples. Can be a ChatAgent, RolePlaying, or Workforce instance that will be cloned for each example.
- chat_turn_limit (int): Maximum number of conversation turns allowed when using RolePlaying pipeline. (default: :obj:
10
) - roleplaying_summarizer (Optional[ChatAgent]): Optional ChatAgent to summarize RolePlaying conversations. If None and RolePlaying is used, a default summarizer will be created. (default: :obj:
None
) - task_json_formatter (Optional[ChatAgent]): Optional ChatAgent to format task JSON. If None and Workforce is used, a default formatter will be created. (default: :obj:
None
)
make_report
Create a standalone HTML report from an EvalResult.
validate
Validate the raw results using the GRADER_TEMPLATE and ChatAgent.
This method evaluates the correctness of each response by multi-threading. A dedicated chat agent is created in each thread. The chat agent will compare raw result with the expected answer. The grading results will be aggregated in a report.
Parameters:
- grader: The ChatAgent used for validation. If None, a default agent will be created in each thread. If provided, the provided agent will be used as a template and be cloned into new agents in each thread. (default: :obj:
None
)