Camel.benchmarks.apibank
process_messages
Processes chat history into a structured format for further use.
Parameters:
- chat_history (List[Dict[str, Any]): A list of dictionaries representing the chat history.
- prompt (str): A prompt to be set as the system message.
Returns:
List[Dict[str, str]]: A list of dictionaries representing the processed messages, where each dictionary has:
- ‘role’: The role of the message (‘system’, ‘user’, or ‘assistant’).
- ‘content’: The content of the message, including formatted API responses when applicable.
APIBankBenchmark
API-Bank Benchmark adapted from API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
<https://github.com/AlibabaResearch/DAMO-ConvAI/tree/main/api-bank>
.
Parameters:
- save_to (str): The file to save the results.
- processes (int, optional): The number of processes to use. (default: :obj:
1
)
init
Initialize the APIBank benchmark.
Parameters:
- save_to (str): The file to save the results.
- processes (int, optional): The number of processes to use for parallel processing. (default: :obj:
1
)
download
Download APIBank dataset and code from Github.
load
Load the APIBank Benchmark dataset.
Parameters:
- level (str): Level to run benchmark on.
- force_download (bool, optional): Whether to force download the data.
run
Run the benchmark.
Parameters:
- agent (ChatAgent): The agent to run the benchmark.
- level (
Literal['level-1', 'level-2']
): The level to run the benchmark on. - randomize (bool, optional): Whether to randomize the data.
- api_test_enabled (bool): Whether to test API calling (
True
) or response (False
) (default: :obj:False
) - subset (Optional[int], optional): The subset of data to run. (default: :obj:
None
)
Returns:
Dict[str, Any]: The results of the benchmark.
agent_call
Add messages to agent memory and get response.
calculate_rouge_l_score
Calculate rouge l score between hypothesis and reference.
get_api_call
Parse api call from model output.
APIBankSample
APIBank sample used to load the datasets.
init
repr
from_chat_history
Evaluator
Evaluator for APIBank benchmark.