1M Tokens: It’s Not the Size, It’s How You Use It

I’m not being hyperbolic. A typed decision item — “we use leaky bucket for rate limiting, chosen over token bucket for smoother traffic shaping” — is 40 tokens. The conversation where that decision was discussed, with all the back-and-forth, the code examples, the alternatives explored, the tangents? Maybe 50K tokens. The 40-token version is more useful for future sessions than the 50K-token version, because it’s structured, searchable, and directly applicable.
I read that three times trying to figure out if I was missing something. You had 4,700 lines of state management scaffolding… and your solution was to not manage state at all? To just dump everything into a bigger pile and hope the model figures it out?
I’ve been building on top of these models for a while now, and I’m increasingly convinced this is a trap. Not because bigger windows are bad — they’re fine. But because they solve the wrong problem, and solving the wrong problem convincingly is worse than not solving it at all.

The bigger-is-better fallacy

500 tokens of the right knowledge outperforms 500K tokens of raw transcript.
A bigger window treats the symptom (not enough room) instead of the disease (no durable knowledge). And it treats it expensively.
This is the difference between recall and understanding. A 1M-token window gives you recall — the raw material is somewhere in there. Understanding requires structure — knowledge that’s been extracted, typed, scoped, and made retrievable.
That’s not progress. That’s giving up on architecture.

What a bigger window can’t do

It can’t represent knowledge that isn’t code. Why did the team choose Postgres over DynamoDB? Who’s the expert on the billing system? What happened in the last production incident? You could have a 10M token window — none of this would be in it, because it doesn’t exist in files.
By Avi Cavale
The shift isn’t 200K to 1M. It’s from stateless to stateful. From tools that forget to tools that learn. That’s not a context window problem. That’s an architecture problem. And architecture doesn’t get solved by making the window bigger.
It can’t remember yesterday. 1M tokens is 1M tokens of this session. Close the tab, it’s gone. The decision your teammate made last week about the database schema? Not in the window. Never was. A bigger window doesn’t create memory. It creates a bigger ephemeral scratch pad.

The state management symptom

The window size arms race is a distraction. The vendors want you to think the constraint is window size because they can sell you a bigger window. The actual constraint is that AI tools are stateless — they have no persistent memory, no organizational knowledge, no compounding intelligence.
A bigger window doesn’t eliminate this problem — it lets you defer it. Instead of hitting the wall at 200K tokens, you hit it at 1M. But the architecture is still wrong. You’re still building on top of a system that forgets everything.
A system that extracts the 500 tokens of durable knowledge from Monday’s session and injects them into Tuesday’s session is infinitely more valuable than one that could hold 10M tokens but forgets everything overnight.

The number I keep coming back to

And when someone tells you “the bigger window made our state management unnecessary” — what they’re really saying is they replaced engineered state management with hoping the model can find things in a larger pile of raw transcript. I’ve seen how that ends. It ends with the model confidently citing something from turn 3 that was corrected in turn 47, because both are in context and the model has no way to know which one is current.
It can’t scale with cost. 1M tokens at current pricing is 10-50x more expensive than a focused 20K-token request. And you’re paying that premium for context the model probably won’t use. I did the math on our workload once. If we stuffed maximum context into every request instead of injecting only what’s relevant, our LLM bill would be roughly 30x higher. For worse results.
I’ve watched this happen on my own team. An engineer has a productive Monday session — makes decisions, discovers patterns, fixes subtle bugs. Tuesday morning, new session. The AI doesn’t know any of it. The engineer re-explains the relevant parts. Wednesday, same thing. By Friday, they’ve spent hours re-teaching instead of building.

The compounding problem

Take a real task: “add rate limiting to the API gateway.” The relevant context is maybe 2,000 tokens — the gateway middleware file, the team’s convention for cross-cutting concerns, and the decision from six months ago to use a particular algorithm. Everything else in your codebase is noise.
Session 1: you discuss the architecture and make decisions. 1M tokens of context. Session 2: you start fresh. Zero tokens of context. Everything from session 1 is gone. You spend 20 minutes rediscovering what you decided yesterday.
Every few months, a model vendor announces a bigger context window. 32K became 128K. 128K became 200K. Now it’s 1M. And every time, the same chorus: “This changes everything. Now we can fit the whole codebase in context.”
I saw a tweet last week from someone who was genuinely excited about the 1M token context window. “We had 4,700 lines of markdown state machines,” they wrote, “and the 1M window made it unnecessary.”
That tweet about 4,700 lines of state management stuck with me because it reveals the real issue. Those 4,700 lines existed because the tool has no memory. The context window was being used as a substitute for persistent state. The engineer was hand-building a memory system in markdown because the product didn’t have one.

Where this leaves us

At 200K tokens, you stuff in 50K of source files and hope the model finds the right ones. At 1M tokens, you stuff in 250K of source files and hope the model finds the right ones. The search space got 5x bigger. The signal didn’t change. You’re paying 5x more for the model to attend to 5x more irrelevant code.
The deepest issue with bigger windows, and the one nobody talks about: they don’t compound.
Here’s what keeps bugging me. No matter how big the window gets, it can’t solve the actual problems:
The 1,000th session with compounding knowledge has access to everything the organization learned in the first 999. The 1,000th session with a 1M-token window starts from scratch — again.

Similar Posts