PromptBench: Evaluation of Large Language Models

Posted on Jan 17, 2025

Tags: Large Language Models, Benchmarking

This note is for Zhu, K., Zhao, Q., Chen, H., Wang, J., & Xie, X. (2024). PromptBench: A Uniﬁed Library for Evaluation of Large Language Models.

PromptBench: a unified library to evaluate LLMs

it consists of several key components that can be easily used and extended by researchers

prompt construction
prompt engineering
dataset and model loading
adversarial prompt attack
dynamic evaluation protocols
analysis tools

PromptBench is designed as an open, general, and flexible codebase for research purpose.

Introduction

current LLMs are sensitive to prompts, vulnerable to adversarial prompt attaches, and exposed to testset data contamination, which pose severe security and privacy issues

there have been various prompt learning algorithms developed based on different evaluation metrics, such as

BDPL (Diao et al., 2022):
GrIPS (Prasad et al., 2022):
Plum (Pan et al., 2023):

existing libraries, such as

LlamaIndex (Liu, 2022)
semantic kernal
LangChain

LlamaIndex and LangChain enhance LLM applications by incorporating databases and various data sources

Semantic Kernel aims to merge AI services with programming languages for versatile AI app development.

Evalharness: offer a framework for evaluating generative language models

Zeno: an AI evaluation platform supporting interaction and visualization, but it is not easy to customize

LiteLLM: implement a unified API call for different LLM service providers

the paper introduces PromptBench, a unified Python library to evaluate LLMs from comprehensive dimensions

not only for standard model evaluations but also for advanced scenarios including adversarial prompt attacks and dynamic evaluations
allows for the incorporation of new evaluation protocols
it consists of a wide range of LLMs and evaluation datasets, covering diverse tasks, evaluation protocols, adversarial prompt attacks, and prompt engineering techniques
it also supports several analysis tools for interpreting the results
the library is designed in a modular fashion

PromptBench

2.1 Components

Models

currently, it supports a diverse range of LLMs and VLMs (Vision-language models), ranging from

Llama2 series
Mixtral series
LlaVa series

it provides unified LLMModel and VLMModel interfaces to allow easy construction and inference of a model with specified max generating tokens and generating temperature

Datasets and tasks

diverse challenges across 12 tasks and 22 public datasets

the supported tasks include

fundamental NLP tasks such as
- sentiment analysis
- grammar correctness
- duplicate sentence detection
complex challenges involving
- natural language inference
- multi-task knowledge
- reading comprehension
specialized areas
- translation
- mathematical problem-solving
- various forms of reasoning—logical, commonsense, symbolic, and algorithmic

Prompts and prompt engineering

offers a suite of 4 distinct prompt types, and users have the flexibility to craft custom prompts using the Prompt interface

task-oriented prompts are structured to clearly delineate the specific task expected of the model
role-oriented prompts position the model in a defined role, such as an expert, advisor, or translator

these prompt categories are adaptable for both zero-shot and few-shot learning contexts

Adversarial prompt attacks

investigation of LLMs roustness on prompts, integrate 4 types of attacks

character-level attacks: manipulate texts by introducing typos or errors to words
word-level attacks: replace words with synonyms or contextually similar words to deceive LLMs
sentence-level attacks: append irrelevant or extraneous sentences to the end of the prompts, intending to distract LLMs
semantic-level attacks: simulate the linguistic behavior of people from different countries

Different evaluation protocols

by default, supports the standard protocol, i.e., the direct inference

it further supports dynamic and semantic evaluation protocols by dynamically generating testing data

Analysis tools

sweep running
attention visualization analysis

2.2 Evaluation pipeline

specify task and then load dataset via pb.DatasetLoader
users can customize LLMs using pb.LLMModel
the prompt for the specified dataset via pb.Prompt
the pipeline requires the definition of input and output processing functions, as well as the evaluation function via pb.metrics

2.3 Supported research topics

Conclusion and Discussion

limitations:

may not cover all evaluation scenarios
some metrics might miss nuanced performance differences
the effectiveness depends on the quality and diversity of datasets and prompts

Published in categories

← previous next →

See all posts →

WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.