For the past few months I’ve been working on Context for AI models. I genuinely believe this is one of the biggest opportunities in the industry right now.
Because context is messy.
Most models choke at ~100k tokens. And “just do RAG” often feels underwhelming, not because RAG is bad, but because RAG alone is not memory. It’s a search tool.
So… how should we approach context?
I think the best starting point is the human brain.
Now, what tools do we actually have in engineering to approximate that?
We have three primitives:
1) Compaction (summaries / rolling state)
This is why tools like Cursor / Claude Code can keep going, they compress the past into something smaller.
Downside: extreme compaction drifts. It loses details. It loses grounding.
2) Compression (real compression, not summarization)
This is the exciting part. Recently, beacon-style compression showed you can compact tokens (think ~8x) and keep inference fast. This opens the door to mid-term context at a scale that feels like “working memory”, not “tiny window”.
3) RAG (retrieval + citations)
RAG is great for long-term. It’s not perfect, it’s not always precise, but it’s the best tool we have to search large archives and bring back evidence.