Isn't the ability to store past reasoning in an external system to avoid having to do the computation all over again precisely what a memory is though?
But mathematically KV-caching, instead of doing prefilling at every token is equivalent, sure. But the important part of my message was the attention.
A plan/reasoning made during the forward pass of token 0 can be looked at by subsequent (or parallel if you don’t want to use the cache) passes of token 1,…,n. So you cannot consider token n to be starting from scratch in terms of reasoning/planning as it can reuse what has already been planned in previous tokens.
If you think about inference with KV-caching, even though you are right that mathematically it’s just an optimization, it makes this behavior much more easy to reason about: the kv-cache is a store of past internal states, that the model can attend to for subsequent tokens, which allows that subsequent token internal hidden states to be more than just a repetition of what the model already reasoned about in the past.
It's correct to states the LLM starts anew for each token.
The work around for this is to pass the existing plan back into it as part of the context.