SimulBench

Evaluating LLMs with Diverse Simulation Tasks

Qi Jia4, Xiang Yue2,4, Tianyu Zheng4, Jie Huang3, Bill Yuchen Lin1

1Allen Institute for AI, 2Ohio State University, 3University of Illinois at Urbana-Champaign, 4Multimodal Art Projection Research Community,

geometric reasoning

Examples of LLMs' performance on simulation tasks from SimulBench.

Introduction

We introduce SimulBench, a benchmark designed to evaluate large language models (LLMs) across a diverse collection of simulation scenarios, such as acting as a Linux terminal or playing text games with users. While these simulation tasks serve as effective measures of an LLM's general intelligence, they are seldom incorporated into existing benchmarks. A major challenge is to develop a shared evaluation environment for testing different LLMs in multi-turn interactions between users and AI. To tackle this issue, we suggest using a fixed LLM as a user agent to engage with multiple target LLMs under evaluation. To facilitate automatic assessment on SimulBench, GPT-4 is employed as the evaluator, tasked with reviewing the quality of the multi-turn dialogues between the user agent and the target LLMs. Our comprehensive experiments indicate that these simulation tasks continue to pose a significant challenge for even the most advanced open LLMs. For example, GPT-4-turbo outperforms Llama-2-70b-chat by a margin of 37.95%.

SimulBench

Motivation

The ability of large language models (LLMs) to simulate complex tasks is pivotal in driving the evolution of AI towards achieving general intelligence. These models exhibit remarkable versatility by adeptly assuming a wide range of roles—from acting as a Linux terminal to serving as an investment manager—highlighting their adaptability across various domains. Such flexibility underscores their potential for broad implementation. Consequently, the development of a benchmark dataset for simulation tasks is imperative in nurturing LLMs' progression towards becoming true generalists.

Nonetheless, existing benchmarks do not fully evaluate this potential. Current evaluations mainly focus on single-turn, static interactions between users and LLMs. While MT-bench attempts to consider multi-turn interactions with 80 examples, its reliance on predefined second queries fails to effectively examine the dynamic responses of different LLMs when engaging with users in complex, long-horizon simulation tasks.

In addition, these benchmarks primarily concentrate on tasks related to general information retrieval and creative writing, with less emphasis on complex simulation abilities. The existing role-playing benchmarks tend to focus on replicating the language styles of famous characters, an aspect that barely begins to explore the full range of simulation capabilities. This highlights the need for more comprehensive benchmarks that can thoroughly assess the wide-ranging simulation potential of LLMs.

Tasks for SimulBench

We have gathered 168 distinct simulation tasks that require ChatGPT to perform in a variety of roles. These roles include acting as a Linux terminal, an SQL executor, a player for text-based games such as tic-tac-toe, an interviewer for specific roles such as a software developer, a personal investment manager, a generator for passwords with particular constraints, an ASCII art creator, a predictor of chemical reactions, and more. Each task specification comes with a role description, some output requirements and an initial user request.

Dynamic Multi-Turn Evaluation

MT-bench is designed to test Language Models (LLMs) in a two-turn conversation, where the second turn is predefined. However, our SimulBench necessitates multiple turns between users and LLMs. Depending on the task types and context window limit, some tasks may involve conversations exceeding 5 turns, with the majority spanning over 2 turns. To replicate realistic usage scenarios of LLMs, we employ OpenAI's GPT-3.5 to simulate a user interacting continuously with the LLMs under evaluation. After gathering the dialogues between the user agent and each target LLM, we follow the methodology of previous studies, using GPT-4 to assess and rate the quality of these conversations.

Experimental Results and Findings

Our study involved an analysis of four widely used open Language Learning Models (LLMs), specifically Llama-2-chat (7B/13B/70B) and Mixtral (8x7B). These models are often ranked highly on several existing leaderboards, such as the Chatbot Arena. Our findings revealed a substantial performance gap between these open-source LLMs and GPT-4-turbo. Even the largest open LLM, Llama-2-70b-chat, was surpassed by GPT-4 turbo by a margin of 37.95\% on the hard subset of SimulBench.

We also observed that LLMs, including GPT-4-turbo, tend to exhibit poorer performance in simulation tasks that necessitate objective outputs. For instance, GPT-4-turbo's performance on simulation tasks of simple but useful objective tools, such as generating a random password with requirements for characters, scored only 8.20, conspicuously dropped from 9.23 of the subjective ones. In addition, we noticed a significant decrease in LLMs' performance when simulation tasks demand long-horizon memory, complex reasoning, and compositional instructions, especially for open LLMs. Mixtral achieved 9.07 on general role-playing tasks, but only 7.43 in simulation tasks that required a self-contained system (such as a Linux terminal).

Main Results

This section outlines the results for benchmarking the LLMs on SimulBench. We delve into different categories of simulation tasks for more in-depth analysis.

Based on the nature of the output information, we have bifurcated the tasks into two distinct categories:

  • Objective: This category includes 90 tasks that involve information which can be fact-checked or verified according to set rules and facts;
  • Subjective: This category, on the other hand, comprises 78 tasks that primarily focus on feelings, opinions, or emotions.

We categorize the simulation tasks into three distinct types based on their targets:

  • System: In 21 instances, the task requires simulating a real or game-based system, platform, or environment, among others;
  • Tool: In 38 instances, the task is to simulate practical tools;
  • Role: In 109 instances, the task involves impersonating a real person, a specific profession, or a fictional character.

We also further filtered out a challenging subset of SimulBench, indicated by their lower scores and a notable gap between the proprietary model and open-source models.

Overall, we found that LLMs struggle with objective simulations and struggle to accurately simulate systems and tools.

Micro-averaged scores of different LLMs on SimulBench.
Model Information Property Simulation Target All Hard
Objective Subjective System Tool Role
GPT-4 Turbo 8.89 9.21 8.90 8.55 9.23 9.04 8.47
Mixtral 7.69 9.00 6.81 7.84 8.75 8.30 6.14
Llama-2-70B-chat 7.69 9.00 6.81 7.84 8.75 8.30 6.14
Llama-2-13B-chat 7.37 8.93 5.48 7.24 8.89 8.09 4.93
Llama-2-7B-chat 7.38 8.87 6.29 7.13 8.74 8.07 5.00

BibTeX


      @article{simulbench2024,
        title={SimulBench: Evaluating LLMs with Diverse Simulation Tasks},
        author={Qi Jia, Xiang Yue, Tianyu Zheng, Jie Huang, and Bill Yuchen Lin},
        year={2024},
        eprint={},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
      }