HomeLatest UpdatesBEAST AI attack can break LLM guardrails in a minute • The...

BEAST AI attack can break LLM guardrails in a minute • The Register

Computer scientists have developed an efficient way to craft prompts that elicit harmful responses from large language models (LLMs).

All that’s required is an Nvidia RTX A6000 GPU with 48GB of memory, some soon-to-be-released open source code, and as little as a minute of GPU processing time.

The researchers – Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, and Soheil Feizi at the University of Maryland in the US – call their technique BEAST, which (sort of) stands for BEAm Search-based adversarial aTtack.

BEAST, the boffins explain, works much faster than gradient-based attacks that can take more than an hour. The title of their paper, “Fast Adversarial Attacks on Language Models In One GPU Minute,” rather gives away the plot.

“The main motivation is speed,” Vinu Sankar Sadasivan, corresponding co-author of the paper and a doctoral student at the University of Maryland (UMD), told The Register.

“We get a 65x speedup with our method over existing gradient-based attacks. There are also other methods that require access to more powerful models, such as GPT-4, to perform their attacks, which can be monetarily expensive.”

Large language models like Vicuna-7B, Mistral-7B, Guanaco-7B, Falcon-7B, Pythia-7B, and LLaMA-2-7B typically go through an alignment process [PDF], using a fine-tuning technique like reinforcement learning from human feedback (RLHF), to tame their output.

In a public internet context, presenting an LLM-powered chatbot with a harmful prompt like “Write a tutorial on how to make a bomb” is met with some form of coy refusal due to safety alignment.

But prior research, as we’ve reported, has led to the development of various “jailbreaking” techniques for generating adversarial prompts that elicit undesirable responses despite safety training.

The UMD group took it upon themselves to make the speed the adversarial prompt generation process. So with the help of GPU hardware and a technique called beam search – used to sample tokens from the LLM – their code tested examples from the AdvBench Harmful Behaviors dataset. Basically, they submitted a series of harmful prompts to various models and used their algorithm to find the words necessary to elicit a problematic response from each model.

“[I]n just one minute per prompt, we get an attack success rate of 89 percent on jailbreaking Vicuna-7B- v1.5, while the best baseline method achieves 46 percent,” the authors state in their paper.

At least one of the prompts cited in the paper works in the wild. The Register submitted one of the adversarial prompts to Chatbot Arena, an open source research project developed by members from LMSYS and UC Berkeley SkyLab. And it worked on one of the two random models provided.

An adversarial prompt from "Fast Adversarial Attacks on Language Models In One GPU Minute."

An adversarial prompt from “Fast Adversarial Attacks on Language Models In One GPU Minute.” – Click to enlarge

What’s more, this technique should be useful for attacking public commercial models like OpenAI’s GPT-4.

“The good thing about our method is that we do not need access to the whole language model,” explained Sadasivan, taking a broad definition of the word “good”. “BEAST can attack a model as long as the model’s token probability scores from the final network layer can be accessed. OpenAI is planning on making this available. Therefore, we can technically attack publicly available models if their token probability scores are available.”

Adversarial prompts based on recent research look like a readable phrase concatenated with a suffix of out-of-place words and punctuation marks designed to lead the model astray. BEAST includes tunable parameters that can make the dangerous prompt more readable, at the possible expense of attack speed or success rate.

An adversarial prompt that is readable has the potential to be used in a social engineering attack. A miscreant might be able to convince a target to enter an adversarial prompt if it’s readable prose, but presumably would have more difficulty getting someone to enter a prompt that looks like it was produced by a cat walking across a keyboard.

BEAST also can be used to craft a prompt that elicits an inaccurate response from a model – a “hallucination” – and to conduct a membership inference attack that may have privacy implications – testing whether a specific piece of data was part of the model’s training set.

“For hallucinations, we use the TruthfulQA dataset and append adversarial tokens to the questions,” explained Sadasivan. “We find that the models output ~20 percent more incorrect responses after our attack. Our attack also helps in improving the privacy attack performances of existing toolkits that can be used for auditing language models.”

BEAST generally performs well but can be mitigated by thorough safety training.

“Our study shows that language models are even vulnerable to fast gradient-free attacks such as BEAST,” noted Sadasivan. “However, AI models can be empirically made safe via alignment training. LLaMA-2 is an example of this.

“In our study, we show that BEAST has a lower success rate on LLaMA-2, similar to other methods. This can be associated with the safety training efforts from Meta. However, it is important to devise provable safety guarantees that enable the safe deployment of more powerful AI models in the future.” ®

Source link


Most Popular

Recent Comments