`) for a message. **Parameters:** * **message** (Message): The message to convert to HTML. **Returns:** str: The HTML representation of the message. ## derive\_key ```python theme={"system"} def derive_key(password: str, length: int): ``` Derive a fixed-length key from the password using SHA256. ## decrypt ```python theme={"system"} def decrypt(ciphertext_b64: str, password: str): ``` Decrypt base64-encoded ciphertext with XOR. ## \_compute\_stat ```python theme={"system"} def _compute_stat(values: list, stat: str): ``` ## aggregate\_results ```python theme={"system"} def aggregate_results( single_eval_results: List[SingleEvalResult], default_stats: Tuple[str, str] = ('mean', 'std'), name2stats: Optional[Dict[str, Tuple[str]]] = None ): ``` Aggregate results from multiple evaluations into a single EvalResult. **Parameters:** * **single\_eval\_results** (List\[SingleEvalResult]): A list of `SingleEvalResult` objects. * **default\_stats** (Tuple\[str, str]): A tuple of default statistics to compute. (default: :obj:`("mean", "std")`) * **name2stats** (Optional\[Dict\[str, Tuple\[str]]]): A dictionary mapping metric names to statistics to compute. (default: :obj:`None`) **Returns:** EvalResult: An `EvalResult` object containing aggregated results. ## BrowseCompBenchmark ```python theme={"system"} class BrowseCompBenchmark(BaseBenchmark): ``` BrowseComp Benchmark for evaluating browser-based comprehension tasks. This benchmark evaluates the ability of language models to comprehend and answer questions based on browser-based content, measuring accuracy and performance. ### **init** ```python theme={"system"} def __init__( self, save_to: str, processes: int = 1, num_examples: Optional[int] = None, n_repeats: int = 1 ): ``` Initialize the BrowseComp benchmark. **Parameters:** * **save\_to** (str): The file to save the results. * **processes** (int, optional): The number of processes to use for parallel processing. (default: :obj:`1`) * **num\_examples** (Optional\[int]): Number of examples to evaluate. If None, all examples are used. Controls the sample size for testing. (default: :obj:`None`) * **n\_repeats** (int, optional): Number of times to repeat each example. Useful for evaluating consistency across multiple runs. (default: :obj:`1`) ### download ```python theme={"system"} def download(self): ``` **Returns:** self: The benchmark instance ### load ```python theme={"system"} def load(self): ``` **Returns:** self: The benchmark instance ### train ```python theme={"system"} def train(self): ``` ### run ```python theme={"system"} def run( self, pipeline_template: Union[ChatAgent, RolePlaying, Workforce], chat_turn_limit: int = 10, roleplaying_summarizer: Optional[ChatAgent] = None, task_json_formatter: Optional[ChatAgent] = None ): ``` Run the benchmark by processing each example in parallel. This method applies the provided pipeline to each example in the dataset using a process pool for parallel execution. It shows progress using tqdm and stores the results in self.\_raw\_results. **Parameters:** * **pipeline\_template** (Union\[ChatAgent, RolePlaying, Workforce]): The template agent or framework to use for processing examples. Can be a ChatAgent, RolePlaying, or Workforce instance that will be cloned for each example. * **chat\_turn\_limit** (int): Maximum number of conversation turns allowed when using RolePlaying pipeline. (default: :obj:`10`) * **roleplaying\_summarizer** (Optional\[ChatAgent]): Optional ChatAgent to summarize RolePlaying conversations. If None and RolePlaying is used, a default summarizer will be created. (default: :obj:`None`) * **task\_json\_formatter** (Optional\[ChatAgent]): Optional ChatAgent to format task JSON. If None and Workforce is used, a default formatter will be created. (default: :obj:`None`) ### make\_report ```python theme={"system"} def make_report(self, eval_result: EvalResult): ``` Create a standalone HTML report from an EvalResult. ### validate ```python theme={"system"} def validate(self, grader: Optional[ChatAgent] = None): ``` Validate the raw results using the GRADER\_TEMPLATE and ChatAgent. This method evaluates the correctness of each response by multi-threading. A dedicated chat agent is created in each thread. The chat agent will compare raw result with the expected answer. The grading results will be aggregated in a report. **Parameters:** * **grader**: The ChatAgent used for validation. If None, a default agent will be created in each thread. If provided, the provided agent will be used as a template and be cloned into new agents in each thread. (default: :obj:`None`)