Documentation Index
Fetch the complete documentation index at: https://docs.camel-ai.org/llms.txt
Use this file to discover all available pages before exploring further.
QueryResponse
class QueryResponse(BaseModel):
A structured query response for benchmark evaluation.
This class defines the expected format for model responses to benchmark
questions, including explanation, exact answer, and confidence score.
GradingResponse
class GradingResponse(BaseModel):
A structured grading response for evaluating model answers.
This class defines the expected format for grading responses, including
extracted answer, reasoning about correctness, binary correctness judgment,
and confidence score extraction.
SingleEvalResult
class SingleEvalResult(BaseModel):
Result of evaluating a single benchmark sample.
This class stores the evaluation results for a single benchmark example,
including score, HTML representation, conversation history, and metrics.
EvalResult
class EvalResult(BaseModel):
Result of running a complete benchmark evaluation.
This class aggregates results from multiple sample evaluations, storing
the overall score, detailed metrics, HTML reports, and conversation logs.
JinjaEnv
A class that encapsulates the Jinja environment setup.
init
Initialize the JinjaEnv instance if not already initialized.
new
Implement singleton pattern to ensure only one instance exists.
get_instance
Returns:
JinjaEnv: The singleton instance.
env
Returns:
jinja2.Environment: The Jinja environment instance.
from_string
def from_string(self, template_str):
Create a template from the given string.
Parameters:
- template_str (str): The template string.
Returns:
jinja2.Template: The compiled template.
message_to_html
def message_to_html(message: Message):
Generate HTML snippet (inside a <div>) for a message.
Parameters:
- message (Message): The message to convert to HTML.
Returns:
str: The HTML representation of the message.
derive_key
def derive_key(password: str, length: int):
Derive a fixed-length key from the password using SHA256.
decrypt
def decrypt(ciphertext_b64: str, password: str):
Decrypt base64-encoded ciphertext with XOR.
_compute_stat
def _compute_stat(values: list, stat: str):
aggregate_results
def aggregate_results(
single_eval_results: List[SingleEvalResult],
default_stats: Tuple[str, str] = ('mean', 'std'),
name2stats: Optional[Dict[str, Tuple[str]]] = None
):
Aggregate results from multiple evaluations into a single EvalResult.
Parameters:
- single_eval_results (List[SingleEvalResult]): A list of
SingleEvalResult objects.
- default_stats (Tuple[str, str]): A tuple of default statistics to compute. (default: :obj:
("mean", "std"))
- name2stats (Optional[Dict[str, Tuple[str]]]): A dictionary mapping metric names to statistics to compute. (default: :obj:
None)
Returns:
EvalResult: An EvalResult object containing aggregated results.
BrowseCompBenchmark
class BrowseCompBenchmark(BaseBenchmark):
BrowseComp Benchmark for evaluating browser-based comprehension tasks.
This benchmark evaluates the ability of language models to comprehend and
answer questions based on browser-based content, measuring accuracy and
performance.
init
def __init__(
self,
save_to: str,
processes: int = 1,
num_examples: Optional[int] = None,
n_repeats: int = 1
):
Initialize the BrowseComp benchmark.
Parameters:
- save_to (str): The file to save the results.
- processes (int, optional): The number of processes to use for parallel processing. (default: :obj:
1)
- num_examples (Optional[int]): Number of examples to evaluate. If None, all examples are used. Controls the sample size for testing. (default: :obj:
None)
- n_repeats (int, optional): Number of times to repeat each example. Useful for evaluating consistency across multiple runs. (default: :obj:
1)
download
Returns:
self: The benchmark instance
load
Returns:
self: The benchmark instance
train
run
def run(
self,
pipeline_template: Union[ChatAgent, RolePlaying, Workforce],
chat_turn_limit: int = 10,
roleplaying_summarizer: Optional[ChatAgent] = None,
task_json_formatter: Optional[ChatAgent] = None
):
Run the benchmark by processing each example in parallel.
This method applies the provided pipeline to each example in the
dataset using a process pool for parallel execution. It shows progress
using tqdm and stores the results in self._raw_results.
Parameters:
- pipeline_template (Union[ChatAgent, RolePlaying, Workforce]): The template agent or framework to use for processing examples. Can be a ChatAgent, RolePlaying, or Workforce instance that will be cloned for each example.
- chat_turn_limit (int): Maximum number of conversation turns allowed when using RolePlaying pipeline. (default: :obj:
10)
- roleplaying_summarizer (Optional[ChatAgent]): Optional ChatAgent to summarize RolePlaying conversations. If None and RolePlaying is used, a default summarizer will be created. (default: :obj:
None)
- task_json_formatter (Optional[ChatAgent]): Optional ChatAgent to format task JSON. If None and Workforce is used, a default formatter will be created. (default: :obj:
None)
make_report
def make_report(self, eval_result: EvalResult):
Create a standalone HTML report from an EvalResult.
validate
def validate(self, grader: Optional[ChatAgent] = None):
Validate the raw results using the GRADER_TEMPLATE and ChatAgent.
This method evaluates the correctness of each response by
multi-threading. A dedicated chat agent is created in each thread.
The chat agent will compare raw result with the expected answer. The
grading results will be aggregated in a report.
Parameters:
- grader: The ChatAgent used for validation. If None, a default agent will be created in each thread. If provided, the provided agent will be used as a template and be cloned into new agents in each thread. (default: :obj:
None)