Meta proposes scalable memory layers to improve knowledge and reduce hallucinations
Join our daily and weekly emails to receive the latest updates on AI and exclusive content. Learn More
As companies continue to use large language models in various applications, they are faced with the challenge of improving factual knowledge and reducing the hallucinations. In a new paper, researchers at Meta AI propose “scalable memory layers,” which could be one of several possible solutions to this problem.
Scalable memory layers add more parameters to LLMs to increase their learning capacity without requiring additional compute resources. The architecture is useful for applications where you can spare extra memory for factual knowledge but also want the inference speed of nimbler models.
Dense and memory layers
Traditional language models use “dense layers” to encode vast amounts of information in their parameters. Dense layers use all parameters to their maximum capacity, and they are mostly activated simultaneously during inference. Dense layers are capable of learning complex functions. Increasing their complexity requires more computational and energy resources.
Associative memory architectures and simpler layers would be more efficient for storing simple facts. Memory layers perform this function. Memory layers encode and retrieve information using simple sparse activations, key-value lookups and other mechanisms. Sparse layers take up more memory than dense layers but only use a small portion of the parameters at once, which makes them much more compute-efficient.
Memory layers have existed for several years but are rarely used in modern deep learning architectures. They are not optimized to work with current hardware accelerators. Current frontier LLMs use a “mixture-of-experts” (MoE), which is a similar mechanism to memory layers. MoE models consist of smaller expert components which specialize in certain tasks. A routing mechanism is used to determine which expert will be activated at inference time based on input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to millions of experts, providing more granular control over the parameters that become activated during inference.
Upgrading memory layers
Memory layers are light on compute but heavy on memory, which presents specific challenges for current hardware and software frameworks. The Meta researchers have proposed several modifications to overcome these challenges.
First the researchers set up the memory layers to be parallelized, and distributed them across multiple GPUs in order to store millions key-value pairs, without changing the other layers of the model. The researchers also developed a special CUDA Kernel to handle high-memory bandwidth operation. This means that the keys and values used for lookups are shared across layers. This means that the keys and values used for lookups are shared across layers.
These modifications make it possible to implement memory layers within LLMs without slowing down the model.
“Memory layers with their sparse activations nicely complement dense networks, providing increased capacity for knowledge acquisition while being light on compute,” the researchers write. They can be easily scaled and offer practitioners a new way to balance memory and compute. The researchers compared memory-enhanced LLMs with dense LLMs, MoE, and PEER models, on several tasks including factual questions answering, common-sense and scientific world knowledge, and coding.
Their findings show that memory models perform significantly better than dense baselines, and can compete with models using 2X to 4-X more computation. The models also perform the same as MoE models with the same budget and parameter count. Its performance is particularly impressive on tasks requiring factual knowledge. The researchers found that the benefits of memory models remained consistent with model size as they scaled their experiments from 134 million to 8 billion parameters.
Moreover, the researchers found that the benefits of memory models remain consistent with model size as they scaled their experiments from 134 million to 8 billion parameters.
“Given these findings, we strongly advocate that memory layers should be integrated into all next generation AI architectures,” the researchers write, while adding that there is still a lot more room for improvement. “In particular, we hope that new learning methods can be developed to push the effectiveness of these layers even further, enabling less forgetting, fewer hallucinations and continual learning.”
Daily insights on business use cases with VB Daily
If you want to impress your boss, VB Daily has you covered. We provide you with the latest information on how companies are using generative AI. From regulatory changes to practical implementations, we give you all you need to impress your boss.
Thank you for subscribing. Click here to view more VB Newsletters.
An error occured.