favicon StableToolBench

Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models

Zhicheng Guo1, Sijie Cheng1,2, Hao Wang3, Shihao Liang4, Yujia Qin1
Peng Li1, Zhiyuan Liu1, Maosong Sun1, Yang Liu1,
1Tsinghua University 201.AI 3Google 4The University of Hong Kong
{guo-zc21,csj23}@mails.tsinghua.edu.cn

Abstract

Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluation system.

Instability in ToolBench

Benchmarks are designed to consistently evaluate the performance of various models over time. To test this consistency of ToolBench, we reproduce the model performances and record any variations. As depicted in Figure 1, a notable decline in the performance of all methods over time is observed, which raises concerns about the stability of ToolBench as a benchmark. Click here to overview the detailed analysis of the impacts of API status and the evaluation system in ToolBench on stability of the benchmark.

fig1
Figure 1: Comparison of performance (Pass Rate) reported in the paper and reproduced by us on the I1-Instruction group of ToolBench.

StableToolBench

The Virtual API Server

To stabilise the API server, we propose to use the virtual API server. It comprises two primary components: a caching system and an API simulator. The caching system stores responses from all API calls, ensuring consistency and reducing latency. It's populated with data from both the training and test phases and continuously updated to maintain scalability and quality. The API simulator, powered by a large language model (gpt-4-turbo), simulates API responses that aren't in the cache or are unavailable. It utilizes documentation and real API call examples as few-shot prompts to ensure the simulated responses closely mimic real API behavior. Together, these components work under specific calling rules, initially checking the cache for a response before attempting a real API call and, if necessary, resorting to the simulated response. This integrated approach aims to balance stability and reality in API behaviors, significantly enhancing the benchmark's reliability and effectiveness.

fig1
Figure 2: The process of calling APIs in our proposed virtual API server.

The Stable Evaluation System

Solvable Tasks Filtration. Since the solvablility of tasks in original ToolBench induces siginificant instability, we filter out the unsolvable tasks in advance. This process is executed using GPT-4, Gemini Pro, and Claude 2. Each task from the dataset is evaluated by these models to determine its solvability through majority voting. A task is classified as solvable if it provides all the necessary and valid information required for completion and can be resolved with the available tools. Human evaluation shows that these models can effectively filter out unsolvable tasks, ensuring the stability of the benchmark.

I1 Instruction I1 Category I1 Tool I2 Instruction I2 Category I3 Instruction Total
Full 200 200 200 200 200 100 1100
Solvable 163 153 158 106 124 61 765

Table 1: Summary of Task Statistics before and after filtration

Metrics (SoPR and SoWR). Due to the limitation of gpt-3.5-turbo-16k in tool learning, we uniformly adopt gpt-4-turbo-preview as the automatic evaluator. SoPR is in essence PR with all tasks solvable and only assesses the answers using the same prompt in ToolBench. The evaluator assigns outcomes of answers categorised as Solved, Unsolved, or Unsure, which respectively contribute scores of 1, 0.5, and 0 to the overall SoPR calculation. As for SoWR, when one is solved and the other is unsolved, the solved one wins. Under other circumstances, gpt-4-turbo-preview will be used to make a win-lose decision.

Stability of Our System

We randomly select some tools and manually make these tools not available during the running time (see our paper for detailed configurations). Compared to the the real API system (Figure 3), the results run on our system (Figure 4) are much more stable. Even when 50% of APIs are not available, changes in performance are still not significant, which is explainable within the range of variance.
fig1
Figure 3: SoPR change when manually making APIs down with real online API system on I1 Instruction.
fig1
Figure 4: SoPR change when manually making APIs down with our virtual online API system.

Leaderboard

Method I1 Instruction I1 Category I1 Tool I2 Category I2 Instruction I3 Instruction Average
GPT-3.5-Turbo-0613 (CoT) 55.9±1.0 50.8±0.8 55.9±1.0 44.1±0.8 36.2±0.4 51.4±1.5 49.1±1.0
GPT-3.5-Turbo-0613 (DFS) 66.4±1.5 64.3±1.0 67.2±2.4 67.7±0.8 61.5±1.0 81.4±1.5 68.1±1.4
GPT-4-0613 (CoT) 50.7±0.4 57.1±0.3 51.9±0.3 55.0±1.1 61.6±0.8 56.3±0.8 55.4±0.6
GPT-4-0613 (DFS) 65.5±1.1 62.0±1.7 72.1±1.6 70.8±1.3 73.1±1.4 74.9±1.5 69.7±1.4
ToolLLaMA v2 (CoT) 37.2±0.1 42.3±0.4 43.0±0.5 37.4±0.4 33.6±1.2 39.6±1.0 38.9±0.6
ToolLLaMA v2 (DFS) 59.8±1.5 59.5±1.4 65.7±1.1 56.5±0.3 47.6±0.4 62.8±1.9 58.7±1.1
GPT-3.5-Turbo-1106 (CoT) 51.3±0.6 48.8±0.3 59.9±0.8 50.8±0.7 43.2±0.8 58.5±0.8 52.1±0.7
GPT-3.5-Turbo-1106 (DFS) 67.8±0.9 67.2±0.3 72.9±0.7 63.2±1.0 70.9±0.4 77.6±0.8 69.9±0.7
GPT-4-Turbo-Preview (CoT) 63.1±1.0 64.5±0.5 55.3±0.3 63.0±0.8 57.3±0.8 61.7±0.8 60.8±0.7
GPT-4-Turbo-Preview (DFS) 70.8±1.0 71.1±0.7 70.4±1.2 70.4±1.3 71.7±0.4 84.7±1.7 73.2±1.1

Table 2: Solvable Pass Rate scores. In this experiment, we run all models once, evaluate three times and take the average results.

Method I1 Instruction I1 Category I1 Tool I2 Category I2 Instruction I3 Instruction Average
GPT-3.5-Turbo-0613 (DFS) 57.7 60.8 61.4 66.1 63.2 70.5 63.3
GPT-4-0613 (CoT) 50.3 54.2 50.6 50.0 64.2 55.7 54.2
GPT-4-0613 (DFS) 57.1 60.1 57.0 64.5 74.5 72.1 64.2
ToolLLaMA v2 (CoT) 35.0 30.7 37.3 31.5 36.8 23.0 32.4
ToolLLaMA v2 (DFS) 43.6 45.1 38.6 42.7 53.8 45.9 44.9
GPT-3.5-Turbo-1106 (CoT) 46.6 45.1 48.1 44.4 37.7 52.5 45.7
GPT-3.5-Turbo-1106 (DFS) 56.4 54.2 51.9 54.0 62.3 72.1 58.5
GPT-4-Turbo-Preview (CoT) 68.7 71.9 58.2 71.0 76.4 73.8 70.0
GPT-4-Turbo-Preview (DFS) 66.9 73.9 68.4 72.6 78.3 77.0 72.9

Table 3: Solvable Win Rate scores. We run all models once against GPT-3.5-Turbo-0613 + CoT and evaluate three times. We follow the ToolBench implementation to take the most frequent result for each query during evaluation.

BibTeX

If you like our project, please consider cite our work as follows.
@misc{guo2024stabletoolbench,
  title={StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models}, 
  author={Zhicheng Guo and Sijie Cheng and Hao Wang and Shihao Liang and Yujia Qin and Peng Li and Zhiyuan Liu and Maosong Sun and Yang Liu},
  year={2024},
  eprint={2403.07714},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}