We use an adaptive classifier to learn how many tokens the model takes to respond correctly on a known dataset. I used the https://huggingface.co/adaptive-classifier/llm-router for experiments it is based on distilbert.
Query complexity in this context is based on how many tokens it took for the model to respond to a query correctly based on a ground truth dataset like GSM8k. The adaptive classifier learns over this dataset and then we use it at inference for classification.
Yes, if you only care about correctness, you always use the maximum possible inference compute. Everything that does not do that is trading correctness for speed.
Yes, the goal here is to avoid overthinking and be as efficient as possible in terms of the minimal tokens required to solve a query. Often, queries that require too many tokens are unlikely to lead to correct answers anyways otherwise they would show up when we are learning the classifier.
If you ask it to rethink the problem again because you've found a flaw, does it bump up the complexity and actually think about it. Like a person might give you a quick answer to something and then questioning the answer would cause them to think deeper about it.
The short answer is in general yes it helps improve the accuracy, there is a whole line of work on self consistency and critique that supports it. Many of those approaches are already implemented in optillm.
If compute is limited, then dedicating more resources to the questions that are more likely to need it will increase correctness overall, even if it may decrease correctness for some individual responses.
This sounds like an interesting idea, can you elaborate more may be with a concrete example. I am wondering if this can be implemented easily as a plugin in optillm.
The motivation for AutoThink came from watching how current reasoning models waste computation - they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs. This seemed obviously inefficient.
The breakthrough was combining two techniques I'd been working on separately: adaptive classification (which can learn new categories without retraining) and an open source implementation of Pivotal Token Search from Microsoft's Phi-4 paper. When I put them together with dynamic token budgeting, the performance gains were much better than expected.
What surprised me most was that the technique actually uses fewer tokens on average while improving performance. The adaptive allocation means simple queries finish faster, offsetting the extra computation on complex ones.
A few technical notes:
- The steering vectors are small (typically <1MB per pattern) and add minimal memory overhead
- Classification adds about 10ms latency, which is negligible
- Target layer selection matters - I found middle layers (15-20) work best for most models
I'd love feedback on:
- Have you tried similar adaptive approaches with your models?
- What other reasoning patterns would be useful to steer toward?
- Ideas for automatically detecting the optimal target layer?
Thanks for checking it out! Happy to answer any questions about the implementation or results.
> they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs.
Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.
The original o1 also didn't do this. Neither did the actual DeepSeek R1. You could even get it to answer immediately without any reasoning tokens. These highly distilled versions just lost most of their common sense for this.
Huge assumption, there is a wide range of various parameters that goes into how accurate you need an response to be, depending on context. As sure as there exists questions that you need 100% accurate response regardless of response times, I'm sure there exists questions on the other extreme.
In this situation you would have someone with actual knowledge of the mechanics involved do the computation using the actual data (e.g., what's the mass of the train? Which kind of breaks does it have?) instead of asking an LLM and trusting it to give the correct answer without checking.
Assuming you could find an expert like that in time, and that they will then be able to understand and solve the problem fast enough to still be helpful.
If you need the answer within a couple hours, you can probably get it for an expert; if you need to get an actionable answer within minutes, based on some back-of-the-envelope calculations, then a SOTA LLM is a much safer bet than flagging whoever seems the smartest in the room and asking them for help.
What I really don't like is that I can't manually decide how much thinking it Gemini should allocate to a prompt. You're right sometimes it doesn't think but for me this also happens on complex query where I WOULD want it to think. Even things like "super think about this" etc don't help, it just refuses to
Yes, we started with the idea of trying to replicate similar control on thinking processes for open reasoning models. They also announced the Deep Think approach at IO which goes even further and combines parallel CoTs at inference.
Definitely, in my experience. Elsewhere in the thread, OP says that open models/systems don't do this, in which case this seems like important work toward making open alternatives competitive.
Congratulations! Any work to optimise efficiency w.r.t LLMs is much appreciated.
So far I’ve taken only lazy approach to optimising local LLMs by sending small queries to my M4 Mac Mini running MLX models and sending larger queries to my Nvidia 4090; it’s remarkable how efficient M4 is compared to Nvidia and I think Apple is in the right direction with MLX.
I would read about AutoThink and try to integrate it with my workflow.
I have thought it might be worth seeding responses with the output of non-reasoning models, so after the user prompt, inject a block of "a non-reasoning model thought this:... stuff ....Was that what the user wanted?" For the instances where the non reasoning version was sufficient it might help the reasoning model get to the point earlier.
I actually managed to replicate the new SOTA for circle packing in unit squares as found in the alphaevole paper - 2.635 for 26 circles in a unit square. Took about 800 iterations to find the best program which itself uses an optimisation phase and running it lead to the optimal packaging in one of its runs.
I think you've nailed the key point. A lot of "coding" isn't actually writing code, but understanding the problem space and designing a good solution. If I'm spending too long wrestling with the implementation, it's usually a sign that I didn't fully grasp the problem upfront or my design is flawed. Good tooling helps, for sure, but it's no substitute for solid problem analysis.
It's true, predicting Nvidia's downfall has become a recurring theme. It's easy to underestimate a company that consistently adapts and innovates. Maybe the narrative isn't about "stealing their lunch" but rather carving out specialized niches.