WeiYa's Work Yard

A traveler with endless curiosity, who fell into the ocean of statistics, tries to write down his ideas and notes to save himself.

PromptBench: Evaluation of Large Language Models

Posted on
Tags: Large Language Models, Benchmarking

This note is for Zhu, K., Zhao, Q., Chen, H., Wang, J., & Xie, X. (2024). PromptBench: A Unified Library for Evaluation of Large Language Models.

PromptBench: a unified library to evaluate LLMs

it consists of several key components that can be easily used and extended by researchers

  • prompt construction
  • prompt engineering
  • dataset and model loading
  • adversarial prompt attack
  • dynamic evaluation protocols
  • analysis tools

PromptBench is designed as an open, general, and flexible codebase for research purpose.

Introduction

current LLMs are sensitive to prompts, vulnerable to adversarial prompt attaches, and exposed to testset data contamination, which pose severe security and privacy issues

there have been various prompt learning algorithms developed based on different evaluation metrics, such as

  • BDPL (Diao et al., 2022):
  • GrIPS (Prasad et al., 2022):
  • Plum (Pan et al., 2023):

existing libraries, such as

  • LlamaIndex (Liu, 2022)
  • semantic kernal
  • LangChain

LlamaIndex and LangChain enhance LLM applications by incorporating databases and various data sources

Semantic Kernel aims to merge AI services with programming languages for versatile AI app development.

Evalharness: offer a framework for evaluating generative language models

Zeno: an AI evaluation platform supporting interaction and visualization, but it is not easy to customize

LiteLLM: implement a unified API call for different LLM service providers

the paper introduces PromptBench, a unified Python library to evaluate LLMs from comprehensive dimensions

  • not only for standard model evaluations but also for advanced scenarios including adversarial prompt attacks and dynamic evaluations
  • allows for the incorporation of new evaluation protocols
  • it consists of a wide range of LLMs and evaluation datasets, covering diverse tasks, evaluation protocols, adversarial prompt attacks, and prompt engineering techniques
  • it also supports several analysis tools for interpreting the results
  • the library is designed in a modular fashion

PromptBench

2.1 Components

Models

currently, it supports a diverse range of LLMs and VLMs (Vision-language models), ranging from

  • Llama2 series
  • Mixtral series
  • LlaVa series

it provides unified LLMModel and VLMModel interfaces to allow easy construction and inference of a model with specified max generating tokens and generating temperature

Datasets and tasks

diverse challenges across 12 tasks and 22 public datasets

the supported tasks include

  • fundamental NLP tasks such as
    • sentiment analysis
    • grammar correctness
    • duplicate sentence detection
  • complex challenges involving
    • natural language inference
    • multi-task knowledge
    • reading comprehension
  • specialized areas
    • translation
    • mathematical problem-solving
    • various forms of reasoning—logical, commonsense, symbolic, and algorithmic

Prompts and prompt engineering

offers a suite of 4 distinct prompt types, and users have the flexibility to craft custom prompts using the Prompt interface

  • task-oriented prompts are structured to clearly delineate the specific task expected of the model
  • role-oriented prompts position the model in a defined role, such as an expert, advisor, or translator

these prompt categories are adaptable for both zero-shot and few-shot learning contexts

Adversarial prompt attacks

investigation of LLMs roustness on prompts, integrate 4 types of attacks

  • character-level attacks: manipulate texts by introducing typos or errors to words
  • word-level attacks: replace words with synonyms or contextually similar words to deceive LLMs
  • sentence-level attacks: append irrelevant or extraneous sentences to the end of the prompts, intending to distract LLMs
  • semantic-level attacks: simulate the linguistic behavior of people from different countries

Different evaluation protocols

by default, supports the standard protocol, i.e., the direct inference

it further supports dynamic and semantic evaluation protocols by dynamically generating testing data

Analysis tools

  • sweep running
  • attention visualization analysis

2.2 Evaluation pipeline

  1. specify task and then load dataset via pb.DatasetLoader
  2. users can customize LLMs using pb.LLMModel
  3. the prompt for the specified dataset via pb.Prompt
  4. the pipeline requires the definition of input and output processing functions, as well as the evaluation function via pb.metrics

2.3 Supported research topics

Image

Conclusion and Discussion

limitations:

  • may not cover all evaluation scenarios
  • some metrics might miss nuanced performance differences
  • the effectiveness depends on the quality and diversity of datasets and prompts

Published in categories