> ## Documentation Index
> Fetch the complete documentation index at: https://docs.camel-ai.org/llms.txt
> Use this file to discover all available pages before exploring further.

# Camel.benchmarks.browsecomp

<a id="camel.benchmarks.browsecomp" />

<a id="camel.benchmarks.browsecomp.QueryResponse" />

## QueryResponse

```python theme={"system"}
class QueryResponse(BaseModel):
```

A structured query response for benchmark evaluation.

This class defines the expected format for model responses to benchmark
questions, including explanation, exact answer, and confidence score.

<a id="camel.benchmarks.browsecomp.GradingResponse" />

## GradingResponse

```python theme={"system"}
class GradingResponse(BaseModel):
```

A structured grading response for evaluating model answers.

This class defines the expected format for grading responses, including
extracted answer, reasoning about correctness, binary correctness judgment,
and confidence score extraction.

<a id="camel.benchmarks.browsecomp.SingleEvalResult" />

## SingleEvalResult

```python theme={"system"}
class SingleEvalResult(BaseModel):
```

Result of evaluating a single benchmark sample.

This class stores the evaluation results for a single benchmark example,
including score, HTML representation, conversation history, and metrics.

<a id="camel.benchmarks.browsecomp.EvalResult" />

## EvalResult

```python theme={"system"}
class EvalResult(BaseModel):
```

Result of running a complete benchmark evaluation.

This class aggregates results from multiple sample evaluations, storing
the overall score, detailed metrics, HTML reports, and conversation logs.

<a id="camel.benchmarks.browsecomp.JinjaEnv" />

## JinjaEnv

```python theme={"system"}
class JinjaEnv:
```

A class that encapsulates the Jinja environment setup.

<a id="camel.benchmarks.browsecomp.JinjaEnv.__init__" />

### **init**

```python theme={"system"}
def __init__(self):
```

Initialize the JinjaEnv instance if not already initialized.

<a id="camel.benchmarks.browsecomp.JinjaEnv.__new__" />

### **new**

```python theme={"system"}
def __new__(cls):
```

Implement singleton pattern to ensure only one instance exists.

<a id="camel.benchmarks.browsecomp.JinjaEnv.get_instance" />

### get\_instance

```python theme={"system"}
def get_instance(cls):
```

**Returns:**

JinjaEnv: The singleton instance.

<a id="camel.benchmarks.browsecomp.JinjaEnv.env" />

### env

```python theme={"system"}
def env(self):
```

**Returns:**

jinja2.Environment: The Jinja environment instance.

<a id="camel.benchmarks.browsecomp.JinjaEnv.from_string" />

### from\_string

```python theme={"system"}
def from_string(self, template_str):
```

Create a template from the given string.

**Parameters:**

* **template\_str** (str): The template string.

**Returns:**

jinja2.Template: The compiled template.

<a id="camel.benchmarks.browsecomp.JinjaEnv.message_to_html" />

### message\_to\_html

```python theme={"system"}
def message_to_html(message: Message):
```

Generate HTML snippet (inside a `<div>`) for a message.

**Parameters:**

* **message** (Message): The message to convert to HTML.

**Returns:**

str: The HTML representation of the message.

<a id="camel.benchmarks.browsecomp.derive_key" />

## derive\_key

```python theme={"system"}
def derive_key(password: str, length: int):
```

Derive a fixed-length key from the password using SHA256.

<a id="camel.benchmarks.browsecomp.decrypt" />

## decrypt

```python theme={"system"}
def decrypt(ciphertext_b64: str, password: str):
```

Decrypt base64-encoded ciphertext with XOR.

<a id="camel.benchmarks.browsecomp._compute_stat" />

## \_compute\_stat

```python theme={"system"}
def _compute_stat(values: list, stat: str):
```

<a id="camel.benchmarks.browsecomp.aggregate_results" />

## aggregate\_results

```python theme={"system"}
def aggregate_results(
    single_eval_results: List[SingleEvalResult],
    default_stats: Tuple[str, str] = ('mean', 'std'),
    name2stats: Optional[Dict[str, Tuple[str]]] = None
):
```

Aggregate results from multiple evaluations into a single EvalResult.

**Parameters:**

* **single\_eval\_results** (List\[SingleEvalResult]): A list of `SingleEvalResult` objects.
* **default\_stats** (Tuple\[str, str]): A tuple of default statistics to compute. (default: :obj:`("mean", "std")`)
* **name2stats** (Optional\[Dict\[str, Tuple\[str]]]): A dictionary mapping metric names to statistics to compute. (default: :obj:`None`)

**Returns:**

EvalResult: An `EvalResult` object containing aggregated results.

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark" />

## BrowseCompBenchmark

```python theme={"system"}
class BrowseCompBenchmark(BaseBenchmark):
```

BrowseComp Benchmark for evaluating browser-based comprehension tasks.

This benchmark evaluates the ability of language models to comprehend and
answer questions based on browser-based content, measuring accuracy and
performance.

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.__init__" />

### **init**

```python theme={"system"}
def __init__(
    self,
    save_to: str,
    processes: int = 1,
    num_examples: Optional[int] = None,
    n_repeats: int = 1
):
```

Initialize the BrowseComp benchmark.

**Parameters:**

* **save\_to** (str): The file to save the results.
* **processes** (int, optional): The number of processes to use for parallel processing. (default: :obj:`1`)
* **num\_examples** (Optional\[int]): Number of examples to evaluate. If None, all examples are used. Controls the sample size for testing. (default: :obj:`None`)
* **n\_repeats** (int, optional): Number of times to repeat each example. Useful for evaluating consistency across multiple runs. (default: :obj:`1`)

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.download" />

### download

```python theme={"system"}
def download(self):
```

**Returns:**

self: The benchmark instance

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.load" />

### load

```python theme={"system"}
def load(self):
```

**Returns:**

self: The benchmark instance

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.train" />

### train

```python theme={"system"}
def train(self):
```

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.run" />

### run

```python theme={"system"}
def run(
    self,
    pipeline_template: Union[ChatAgent, RolePlaying, Workforce],
    chat_turn_limit: int = 10,
    roleplaying_summarizer: Optional[ChatAgent] = None,
    task_json_formatter: Optional[ChatAgent] = None
):
```

Run the benchmark by processing each example in parallel.

This method applies the provided pipeline to each example in the
dataset using a process pool for parallel execution. It shows progress
using tqdm and stores the results in self.\_raw\_results.

**Parameters:**

* **pipeline\_template** (Union\[ChatAgent, RolePlaying, Workforce]): The template agent or framework to use for processing examples. Can be a ChatAgent, RolePlaying, or Workforce instance that will be cloned for each example.
* **chat\_turn\_limit** (int): Maximum number of conversation turns allowed when using RolePlaying pipeline. (default: :obj:`10`)
* **roleplaying\_summarizer** (Optional\[ChatAgent]): Optional ChatAgent to summarize RolePlaying conversations. If None and RolePlaying is used, a default summarizer will be created. (default: :obj:`None`)
* **task\_json\_formatter** (Optional\[ChatAgent]): Optional ChatAgent to format task JSON. If None and Workforce is used, a default formatter will be created. (default: :obj:`None`)

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.make_report" />

### make\_report

```python theme={"system"}
def make_report(self, eval_result: EvalResult):
```

Create a standalone HTML report from an EvalResult.

<a id="camel.benchmarks.browsecomp.BrowseCompBenchmark.validate" />

### validate

```python theme={"system"}
def validate(self, grader: Optional[ChatAgent] = None):
```

Validate the raw results using the GRADER\_TEMPLATE and ChatAgent.

This method evaluates the correctness of each response by
multi-threading. A dedicated chat agent is created in each thread.
The chat agent will compare raw result with the expected answer. The
grading results will be aggregated in a report.

**Parameters:**

* **grader**: The ChatAgent used for validation. If None, a default agent will be created in each thread. If provided, the provided agent will be used as a template and be cloned into new agents in each thread. (default: :obj:`None`)
