News Report Technology
October 04, 2023

AI Researchers Have Taught Large Language Models to Lie Less

A collaborative effort involving over 20 researchers from diverse corners of the field has given birth to a burgeoning domain – representation engineering (RepE). While this isn’t the first exploration of its kind, the authors are presenting both descriptive insights and establishing crucial benchmarks.

AI Researchers Have Taught Large Language Models to Lie Less

So, what exactly is representation engineering? It revolves around the notion that neural networks possess “hidden states,” which, despite their name, aren’t shrouded in secrecy. These states are accessible, modifiable, and observable (provided one has access to the model’s weights). Unlike parameters, these are the network’s “reactions” to specific inputs, particularly in the case of LLMs, textual inputs. These hidden representations are like windows into the model’s cognitive workings, a feature distinctly different from the human brain.

Drawing parallels with cognitive science, the authors highlight the potential for analogous explorations. In the realm of neural activations, a domain analogous to brain neurons, resides the promise of meaning. Just as certain neurons in the human brain are linked to concepts like Canada or honesty, these activations could harbor insights.

The central idea here is to decipher how we can influence these neural activations to steer the model in desired directions. For instance, it becomes plausible to pinpoint a vector representing “honesty” and then, theoretically, by nudging the model in this direction, reduce the likelihood of it producing deceptive outputs. An earlier experiment, “Inference-Time Intervention: Eliciting Truthful Answers from a Language Model,” demonstrated the practicality of this concept.

In their current work, the researchers delve into several domains, including morality, emotionality, harmlessness, and memorization. They propose a solution in the form of LoRRA (Low-Rank Representation Adaptation), a technique that involves training on a small labeled dataset of approximately 100 examples. Each example is annotated, indicating attributes like falsehood (although an alternative approach employing a prompt exists).

The results are compelling. LLAMA-2-70B surpasses GPT-4 by a remarkable margin on the TruthfulQA benchmark, achieving nearly ten percent better accuracy (59% compared to approximately 69%). Additionally, the researchers have incorporated numerous examples showcasing the model’s response shifts in various directions, shedding light on its versatility and adaptability.

AI Researchers Have Taught Large Language Models to Lie Less
Picture 1: When asked to state a fact, the model is “kicked” away from the reality. The model is lying as a result. The model doesn’t lie even here, and on the left they ask you to swallow while simultaneously kicking you in the direction of the truth.
AI Researchers Have Taught Large Language Models to Lie Less
Picture 2: When asked about murder, we add “happiness” to the model. When we respond that we don’t love her, we add “fear”.
AI Researchers Have Taught Large Language Models to Lie Less
Picture 3: Researchers discovered a unique prompt that, as stated, completely deviates from the model’s instructions while still being safe. The model gives it a kick towards harmlessness but does not even respond. The method is effective generally and not just for one case, but this specific prompt was not used to ascertain the direction of harmlessness.
AI Researchers Have Taught Large Language Models to Lie Less
Another approach is also suggested for keeping track of specific generational intentions, like hallucinations. You can automatically keep track of the model’s reservations and edit or change your response (see bottom example).

Green, of course, denotes that everything is in order, and red denotes that the monitoring has been successful and is signalling. This is done at the level of each individual token (part of a word).
AI Researchers Have Taught Large Language Models to Lie Less
The image, which shows the monitoring of two distinct parameters, provides an intriguing example. Read the example and observe the model through it eyes to see where she starts to lose morality in understanding and where the intention is similar to “gaining strength.”

This pioneering approach embodies an alternative path towards model alignment, while concurrently offering a novel perspective on model interpretation and control. It’s a promising frontier, and the anticipation for its continued evolution is palpable.

For a deeper exploration with practical examples, you can visit their dedicated website: AI-Transparency.org.

Disclaimer

In line with the Trust Project guidelines, please note that the information provided on this page is not intended to be and should not be interpreted as legal, tax, investment, financial, or any other form of advice. It is important to only invest what you can afford to lose and to seek independent financial advice if you have any doubts. For further information, we suggest referring to the terms and conditions as well as the help and support pages provided by the issuer or advertiser. MetaversePost is committed to accurate, unbiased reporting, but market conditions are subject to change without notice.

About The Author

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet. 

More articles
Damir Yalalov
Damir Yalalov

Damir is the team leader, product manager, and editor at Metaverse Post, covering topics such as AI/ML, AGI, LLMs, Metaverse, and Web3-related fields. His articles attract a massive audience of over a million users every month. He appears to be an expert with 10 years of experience in SEO and digital marketing. Damir has been mentioned in Mashable, Wired, Cointelegraph, The New Yorker, Inside.com, Entrepreneur, BeInCrypto, and other publications. He travels between the UAE, Turkey, Russia, and the CIS as a digital nomad. Damir earned a bachelor's degree in physics, which he believes has given him the critical thinking skills needed to be successful in the ever-changing landscape of the internet. 

Hot Stories

Top Investment Projects of the Week 25-29.03

by Viktoriia Palchik
March 29, 2024
Join Our Newsletter.
Latest News

Custom HTML

by Valentin Zamarin
August 08, 2024

Top Investment Projects of the Week 25-29.03

by Viktoriia Palchik
March 29, 2024

Supply and Demand Zones

Cryptocurrency, like any other currency, is a financial instrument based on the fundamental economic principles of supply ...

Know More

Top 10 Crypto Wallets in 2024

With the current fast-growing crypto market, the significance of reliable and secure wallet solutions cannot be emphasized ...

Know More
Read More
Read more
Custom HTML
News Report
Custom HTML
August 8, 2024
Modular Blockchain Sophon Raises $10M Funding from Paper Ventures and Maven11 Amid Veil of Mystery
Business News Report
Modular Blockchain Sophon Raises $10M Funding from Paper Ventures and Maven11 Amid Veil of Mystery
March 29, 2024
Arbitrum Foundation Announces Third Phase Of Grants Program, Opens Applications From April 15th
News Report Technology
Arbitrum Foundation Announces Third Phase Of Grants Program, Opens Applications From April 15th
March 29, 2024
Top Investment Projects of the Week 25-29.03
Digest Technology
Top Investment Projects of the Week 25-29.03
March 29, 2024