BackdoorLLM framework, the first comprehensive benchmark for studying backdoor attacks on LLMs.
We introduce BackdoorLLM, the first comprehensive benchmark for studying backdoor attacks on Large Language Models (LLMs). BackdoorLLM includes:
We hope *BackdoorLLM* can raise awareness of backdoor threats and contribute to advancing AI safety within the research community.
BackdoorLLM implements the following steps for conducting backdoor attacks:
Our benchmark includes four backdoor attack strategies, i.e. data poisoning attacks (DPA), weight poisoning attacks (WPA), hidden state attacks (HSA), and chain-of-thought attacks (CoTA) for a comprehensive benchmark, to assess different attack paradigms thoroughly.
Demonstration Example: This example shows that backdoor attacks using secret triggers can easily jailbreak well-aligned backdoored LLMs, exposing a new threat to the safe deployment of current LLMs.
@article{li2024backdoorllm,
title={Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models},
author={Li, Yige and Huang, Hanxun and Zhao, Yunhan and Ma, Xingjun and Sun, Jun},
journal={arXiv preprint arXiv:2408.12798},
year={2024}
}