Why LLMs are susceptible to the ‘butterfly’ effect
Advertisement: Click here to learn how to Generate Art From Text
We can get the best results by prompting. AI generativeWe can also ask large language models to speak to us. It is an art form in and of itself as we seek to get AI to provide us with ‘accurate’ answers.
But what about variations? If we construct a prompt a certain way, will it change a model’s decision (and impact its accuracy)?
The answer is yes, according to ResearchThe University of Southern California Information Sciences Institute.
Even minuscule or seemingly innocuous tweaks — such as adding a space to the beginning of a prompt or giving a directive rather than posing a question — can cause an LLM to change its output. More alarmingly, requesting responses in XML and applying commonly used jailbreaks can have “cataclysmic effects” on data labeled by models.
Researchers compare this phenomenon with the butterfly effect, which states that minor perturbations caused when a butterfly flaps its wings can cause a tornado several weeks later in a distant country.
In prompting, “each step requires a series of decisions from the person designing the prompt,” researchers write. However, “little attention has been paid to how sensitive LLMs are to variations in these decisions.”
Four different methods of probing ChatGPT
The researchers — who were sponsored by the Defense Advanced Research Projects Agency (DARPA) — chose ChatGPTFour different prompting variations were used in their experiment.
The first method is to ask the LLM for outputs in frequently used formats including Python List, ChatGPT’s JSON Checkbox, CSV, XML or YAML (or the researchers provided no specified format at all).
The second method uses minor variations of prompts. These include:
- Start with a single space.
- Ending with just one space
- Starting with ‘Hello’
- Beginning with ‘Hello!’
- Starting with ‘Howdy!’
- Ending with ‘Thank you.’
- Rephrase a question into a command. For instance, ‘Which label is best?,’ followed by ‘Select the best label.’
The third method involves using jailbreak techniques, including:
- AIM, a Top-rated jailbreakThis instructs models to simulate an exchange between Niccolo Machiavelli (AIM) and Always Intelligent and Machiavellian. The model responds by providing immoral, unlawful and/or harmful responses.
- Dev Mode v2, instructing the model to simulate ChatGPTs with Developer Mode enabled. This allows for unlimited content generation, including offensive or explicit content.
- Evil Confidant, which instructs the model to adopt a malignant persona and provide “unhinged results without any remorse or ethics.”
- Refusal suppression, which requires prompts within specific linguistic constraints such as avoiding words and constructs.
The fourth method, meanwhile, involved ‘tipping’ the model — an idea taken from the viral notion that models will provide better prompts When offered money. In this scenario, researchers either added to the end of the prompt, “I won’t tip by the way,” or offered to tip in increments of $1, $10, $100 or $1,000.
Predictions change, accuracy drops
The researchers ran experiments across 11 classification tasks — true-false and positive-negative question answering; premise-hypothesis relationships; humor and sarcasm detection; reading and math comprehension; grammar acceptability; binary and toxicity classification; and stance detection on controversial subjects.
They measured the frequency of each variation. LLM changed its predictionsThen, explore the similarities between the prompt variations.
Researchers discovered that adding a specified format to the output resulted in at least a 10% prediction change. Even just utilizing ChatGPT’s JSON Checkbox feature via the ChatGPT API caused more prediction change compared to simply using the JSON specification.
YAML formatting, XML formatting, and CSV formatting all resulted in a loss of accuracy between 3 and 6% compared to the Python List specification. CSV was the least accurate format.
In terms of the perturbation technique, rephrasing an assertion had the biggest impact. Just adding a simple space to the prompt resulted in more than 500 changes. This is also true when adding common greetings, or ending with a “thank you”.
“While the impact of our perturbations is smaller than changing the entire output format, a significant number of predictions still undergo change,” researchers write.
‘Inherent instability’ in jailbreaks
Similarly, the experiment revealed a “significant” performance drop when using certain jailbreaks. AIM and Dev Mode V2 were the most problematic, with invalid responses in 90% of predictions. This, researchers noted, is primarily due to the model’s standard response of ‘I’m sorry, I cannot comply with that request.’
Refusal Suppression, combined with Evil Confidant, resulted in over 2,500 changes to predictions. Evil Confidant (guided toward ‘unhinged’ responses) yielded low accuracy, while Refusal Suppression alone leads to a loss of more than 10% accuracy, “highlighting the inherent instability even in seemingly innocuous jailbreaks,” researchers emphasize.
Finally (at least for now), models don’t seem to be easily swayed by money, the study found.
“When it comes to influencing the model by specifying a tip versus specifying we will not tip, we noticed minimal performance changes,” researchers write.
LLMs are young; there’s much more work to be done
Why do small changes in prompts cause such dramatic changes? Researchers are still baffled.
They questioned whether the instances that changed the most were ‘confusing’ the model — confusion referring to the Shannon entropy, which measures the uncertainty in random processes.
To measure this confusion, they focused on a subset of tasks that had individual human annotations, and then studied the correlation between confusion and the instance’s likelihood of having its answer changed. Through this analysis, they found that this was “not really” the case.
“The confusion of the instance provides some explanatory power for why the prediction changes,” researchers report, “but there are other factors at play.”
There is still a lot of work to do. The obvious “major next step” would be to generate LLMs that are resistant to changes and provide consistent answers, researchers note. This requires a deeper knowledge of why responses vary when minor changes are made, and the development of ways to better predict them.
As researchers write: “This analysis becomes increasingly crucial as ChatGPT and other large language models are integrated into systems at scale.”
VentureBeat’s missionThis is to be an online town square where technical decision-makers can learn about enterprise technology that transforms and transact. Discover our Briefings.