PromptBench: Evaluation of Large Language Models
Posted on
This note is for Zhu, K., Zhao, Q., Chen, H., Wang, J., & Xie, X. (2024). PromptBench: A Unified Library for Evaluation of Large Language Models.
PromptBench: a unified library to evaluate LLMs
it consists of several key components that can be easily used and extended by researchers
- prompt construction
- prompt engineering
- dataset and model loading
- adversarial prompt attack
- dynamic evaluation protocols
- analysis tools
PromptBench is designed as an open, general, and flexible codebase for research purpose.
Introduction
current LLMs are sensitive to prompts, vulnerable to adversarial prompt attaches, and exposed to testset data contamination, which pose severe security and privacy issues
there have been various prompt learning algorithms developed based on different evaluation metrics, such as
- BDPL (Diao et al., 2022):
- GrIPS (Prasad et al., 2022):
- Plum (Pan et al., 2023):
existing libraries, such as
- LlamaIndex (Liu, 2022)
- semantic kernal
- LangChain
LlamaIndex and LangChain enhance LLM applications by incorporating databases and various data sources
Semantic Kernel aims to merge AI services with programming languages for versatile AI app development.
Evalharness: offer a framework for evaluating generative language models
Zeno: an AI evaluation platform supporting interaction and visualization, but it is not easy to customize
LiteLLM: implement a unified API call for different LLM service providers
the paper introduces PromptBench, a unified Python library to evaluate LLMs from comprehensive dimensions
- not only for standard model evaluations but also for advanced scenarios including adversarial prompt attacks and dynamic evaluations
- allows for the incorporation of new evaluation protocols
- it consists of a wide range of LLMs and evaluation datasets, covering diverse tasks, evaluation protocols, adversarial prompt attacks, and prompt engineering techniques
- it also supports several analysis tools for interpreting the results
- the library is designed in a modular fashion
PromptBench
2.1 Components
Models
currently, it supports a diverse range of LLMs and VLMs (Vision-language models), ranging from
- Llama2 series
- Mixtral series
- LlaVa series
it provides unified LLMModel and VLMModel interfaces to allow easy construction and inference of a model with specified max generating tokens and generating temperature
Datasets and tasks
diverse challenges across 12 tasks and 22 public datasets
the supported tasks include
- fundamental NLP tasks such as
- sentiment analysis
- grammar correctness
- duplicate sentence detection
- complex challenges involving
- natural language inference
- multi-task knowledge
- reading comprehension
- specialized areas
- translation
- mathematical problem-solving
- various forms of reasoning—logical, commonsense, symbolic, and algorithmic
Prompts and prompt engineering
offers a suite of 4 distinct prompt types, and users have the flexibility to craft custom prompts using the Prompt interface
- task-oriented prompts are structured to clearly delineate the specific task expected of the model
- role-oriented prompts position the model in a defined role, such as an expert, advisor, or translator
these prompt categories are adaptable for both zero-shot and few-shot learning contexts
Adversarial prompt attacks
investigation of LLMs roustness on prompts, integrate 4 types of attacks
- character-level attacks: manipulate texts by introducing typos or errors to words
- word-level attacks: replace words with synonyms or contextually similar words to deceive LLMs
- sentence-level attacks: append irrelevant or extraneous sentences to the end of the prompts, intending to distract LLMs
- semantic-level attacks: simulate the linguistic behavior of people from different countries
Different evaluation protocols
by default, supports the standard protocol, i.e., the direct inference
it further supports dynamic and semantic evaluation protocols by dynamically generating testing data
Analysis tools
- sweep running
- attention visualization analysis
2.2 Evaluation pipeline
- specify task and then load dataset via
pb.DatasetLoader
- users can customize LLMs using
pb.LLMModel
- the prompt for the specified dataset via
pb.Prompt
- the pipeline requires the definition of input and output processing functions, as well as the evaluation function via
pb.metrics
2.3 Supported research topics
Conclusion and Discussion
limitations:
- may not cover all evaluation scenarios
- some metrics might miss nuanced performance differences
- the effectiveness depends on the quality and diversity of datasets and prompts