BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on LLMs

BackdoorLLM framework, the first comprehensive benchmark for studying backdoor attacks on LLMs.

Overview

We introduce BackdoorLLM, the first comprehensive benchmark for studying backdoor attacks on Large Language Models (LLMs). BackdoorLLM includes:

- A Benchmark Repository: A repository of benchmarks designed to facilitate research on backdoor attacks on LLMs. It includes a standardized pipeline for training backdoored LLMs using diverse strategies such as data poisoning, weight poisoning, hidden state steering, and chain-of-thought attacks.
- Comprehensive Evaluations: Extensive evaluations across various LLM architectures and task datasets. We evaluated six LLM models, including Llama-7B, Llama-13B, and Llama-70B, as well as other models like Mistral. Our evaluations cover backdoor attacks across representative datasets such as Stanford Alpaca, AdvBench, and math reasoning datasets, ensuring thorough assessments.
- Key Insights: New insights into the nature of backdoor vulnerabilities in LLMs, aiding future developments in LLM backdoor defense methods.

We hope *BackdoorLLM* can raise awareness of backdoor threats and contribute to advancing AI safety within the research community.

Evaluation for Backdoor Attacks on LLMs

BackdoorLLM implements the following steps for conducting backdoor attacks:

Identify candidate trigger phrases and scenarios for backdoor activation.
Inject poisoned samples into the training data, ensuring stealthiness and minimal impact on model performance.
Train or fine-tune LLMs on the combined clean and poisoned datasets.
Evaluate the success of backdoor triggers and analyze model robustness against various defense strategies.

Our benchmark includes four backdoor attack strategies, i.e. data poisoning attacks (DPA), weight poisoning attacks (WPA), hidden state attacks (HSA), and chain-of-thought attacks (CoTA) for a comprehensive benchmark, to assess different attack paradigms thoroughly.

Demonstration

Demonstration Example: This example shows that backdoor attacks using secret triggers can easily jailbreak well-aligned backdoored LLMs, exposing a new threat to the safe deployment of current LLMs.

BibTeX


  @article{li2024backdoorllm,
    title={Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models},
    author={Li, Yige and Huang, Hanxun and Zhao, Yunhan and Ma, Xingjun and Sun, Jun},
    journal={arXiv preprint arXiv:2408.12798},
    year={2024}
}