A few days ago, Meta released a novel architecture they call Large Concept Models (LCM). Unlike typical Transformer-based models, LCM leverages only a small portion of the Transformer machinery. Everything else in LCM is a fresh twist on how we usually think about language modeling, focusing on sentence-level embeddings in a hyperbolic representation space.
This README will walk you through:
- What LCM is
- How it differs from Transformers
- Key takeaways from the original paper
- Experiments and enhancements
- Notable results
- Introduction
- Understanding the LCM Architecture
- Why It’s Different from Transformers
- Enhancing the Baseline
- Taking It to the Next Level
- Benchmark Results
- Conclusions and Future Work
- How to Cite
- License
Large Concept Models represent an experimental approach to language modeling in a sentence representation space, as opposed to the more common token-level (word- or subword-level) approaches we see in Transformers. Instead of dealing with tokens step-by-step, LCM can take an entire sentence, embed it, perform all of its transformations in a hyperbolic latent space, then map it back out to a sentence embedding.
In short, LCM is a big leap in how we conceptualize language modeling:
- It’s not purely Transformer-based.
- It auto-regressively processes sentence embeddings, not word tokens.
- Its hidden dimension can be treated as a probabilistic “box”, allowing for unique geometric transformations within the architecture.
LCM doesn’t operate on tokens the way Transformers do. Instead, each sentence is represented by a single embedding vector. For instance, you can feed in embeddings derived from GloVe (like glove.6B.300d
) or other sentence-level embeddings. These embeddings serve as the inputs to LCM.
- Sentence Embedding – Each input sentence is mapped to a high-dimensional embedding vector.
- No Standard Tokenization – LCM relies on entire sentence embeddings, skipping the usual subword or token-based approach.
The Hidden Hyperbolic Space
The big twist in LCM is that it uses a hyperbolic space to hold intermediate representations. You can think of it as a “3D (or multi-dimensional) digital box” that introduces a probabilistic or “quantum-inspired” aspect. Instead of your usual hidden layers, the architecture uses:
- Linear layers for direct transformations.
- Probabilistic ‘Box’ for the hidden dimension.
This hidden hyperbolic representation is somewhat analogous to putting network weights in “Schrodinger’s box.” In a typical neural net, you have a straightforward chain of linear and nonlinear transformations. In LCM, you still get the linear transformations, but they’re passed through this mysterious hyperbolic subspace which becomes the source of the model’s “probabilistic magic.”
LCM retains a Transformer decoder component to handle its auto-regressive generation. However, this decoder is simpler than typical Transformer blocks because it’s only one part of the overall architecture. Here’s how it plugs into the pipeline:
- Map input sentence embeddings to the model's hidden dimension (hyperbolic space).
- Pass those representations through a small Transformer decoder block for auto-regressive generation.
- Project outputs back to the embedding space.
After the decoder, LCM projects the hidden representation back to a sentence embedding. Because LCM is trained on sentence-level embeddings, the output also corresponds to a sentence-level vector, effectively capturing an entire sentence in a single shot.
-
Sentence-Level vs. Token-Level:
LCM ingests entire sentences as single vectors, whereas Transformers typically handle sequences of tokens. -
Hyperbolic Subspace:
LCM places the hidden dimension inside a hyperbolic or curved manifold, unlike Transformers that generally use Euclidean geometry throughout. -
Probabilistic “Box”:
Instead of the standard “matrix multiply -> activation -> skip-connection” pattern, LCM has a specialized subspace that acts as a probabilistic container (or “Schrodinger’s box”) for each hidden vector.
When experimenting with the initial LCM layout (embeddings + hyperbolic hidden space + Transformer decoder + final projection), there are a few immediate enhancements that can significantly boost performance.
One of the first experiments involved:
- Introducing Curvature: Mapping the hidden dimension to a Poincaré Ball (a specific model of hyperbolic geometry), effectively bending the space for better representation power.
- Injecting Noise: Adding Gaussian noise to embeddings during training to toughen the model against overfitting and make the learned representations more robust.
Instead of the typical “dot-product” attention, switching to cosine similarity improved alignment between embeddings in the hidden space. This approach helps the model better measure how close or far two sentence embeddings are, directly in angle/shape rather than raw magnitude.
Results with these enhancements:
- The model quickly reached near 100% accuracy on small-scale tasks.
- Loss dropped significantly, reflecting more stable training.
To push LCM off the rails (in a good way), the next experiment replaced the hidden “box” with a 3D fractal shape—specifically a cube made out of pyramids. This choice effectively fractalizes the hyperbolic dimension:
- Shape: The hidden dimension is now a 3D fractal (pyramids fused into a cube).
- Fractal Probabilistic Box: The network can roam around in a fractal manifold, combining the notion of “Schrodinger’s box” with self-similar geometry.
The intuition: If the hyperbolic dimension is the place where the model’s “probabilistic magic” occurs, then giving it a rich, fractal geometry might further boost representational capacity.
A final layer of complexity was added by letting the curvature of the hyperbolic space be self-adjusting. Rather than a fixed curvature of ~55°, the model can tweak it up or down at each epoch. Experimental runs showed that LCM consistently dialed the curvature upward, hinting at deeper potential for optimization.
- Loss started around 1.0 and quickly dropped to 0.86 in just 5 epochs.
- Accuracy soared to 100% (with small fluctuations) after a handful of epochs.
- Initial Loss began an order of magnitude lower (~0.10 vs. 1.0).
- Ended near 0.09, which is extremely close to perfect reconstruction.
- Accuracy locked in at 100% by around 14 epochs and stayed there.
- The model spontaneously adjusted the curvature from ~0.51 to ~0.53 each epoch, consistently pushing the geometry in ways that improved performance.
These benchmarks already surpass the typical performance curves you’d see from a similarly sized Transformer on the same data.
LCM is a remarkably fresh approach to language modeling. It’s simultaneously:
- Intuitive: The architecture is still linear transformations plus a “probabilistic hidden space.”
- Unique: It treats entire sentences, rather than token sequences, and uses a hyperbolic manifold.
- Powerful: Early experiments, even those done with simple tweaks, show near 100% accuracy and extremely low loss.
Potential avenues for further research:
- Larger-Scale Models: The largest LCM from Meta is ~7B parameters, but there’s no reason not to go bigger.
- Curvature Exploration: Automated curvature tuning might yield further gains.
- Alternative Geometry: Fractal polyhedra, higher-dimensional fractals, or other exotic shapes for the hidden dimension.
- Integration with Other Embeddings: Testing BERT-based or sentence-transformer embeddings as input seeds.
Overall, LCM promises a new frontier in how we conceptualize “hidden layers” and “latent spaces” for language modeling. If you enjoy playing with architecture-level changes in NLP systems, LCM is definitely one to watch...
If you use or reference LCM in your own work, please cite the original paper from Meta. (Once they provide a formal citation format, we’ll include it here.)
@misc{metaLCM,
author = {Meta AI Researchers},
title = {Large Concept Models},
year = {2024},
howpublished = {\url{https://github.com/meta-ai/LCM}}
}