Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?
But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.
A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.
If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.
It's still recalculating, just that intermediate steps are cached.