More

codelion · 2025-05-28T10:21:35 1748427695

We use an adaptive classifier to learn how many tokens the model takes to respond correctly on a known dataset. I used the https://huggingface.co/adaptive-classifier/llm-router for experiments it is based on distilbert.

codelion · 2025-05-28T06:24:08 1748413448

Query complexity in this context is based on how many tokens it took for the model to respond to a query correctly based on a ground truth dataset like GSM8k. The adaptive classifier learns over this dataset and then we use it at inference for classification.

bufferoverflow · 2025-05-28T06:30:12 1748413812

So it can be very very wrong.

You're trading correctness for speed.

baobabKoodaa · 2025-05-28T07:33:04 1748417584

Yes, if you only care about correctness, you always use the maximum possible inference compute. Everything that does not do that is trading correctness for speed.

codelion · 2025-05-28T06:42:00 1748414520

Yes, the goal here is to avoid overthinking and be as efficient as possible in terms of the minimal tokens required to solve a query. Often, queries that require too many tokens are unlikely to lead to correct answers anyways otherwise they would show up when we are learning the classifier.

VagabundoP · 2025-05-28T09:19:24 1748423964

If you ask it to rethink the problem again because you've found a flaw, does it bump up the complexity and actually think about it. Like a person might give you a quick answer to something and then questioning the answer would cause them to think deeper about it.

codelion · 2025-05-28T09:29:58 1748424598

The short answer is in general yes it helps improve the accuracy, there is a whole line of work on self consistency and critique that supports it. Many of those approaches are already implemented in optillm.

wat10000 · 2025-05-28T13:29:12 1748438952

If compute is limited, then dedicating more resources to the questions that are more likely to need it will increase correctness overall, even if it may decrease correctness for some individual responses.

xigency · 2025-05-28T21:18:54 1748467134

> You're trading correctness for speed.

That's AI in a nutshell.

codelion · 2025-05-28T03:51:37 1748404297

This sounds like an interesting idea, can you elaborate more may be with a concrete example. I am wondering if this can be implemented easily as a plugin in optillm.

codelion · 2025-05-28T02:40:55 1748400055

The motivation for AutoThink came from watching how current reasoning models waste computation - they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs. This seemed obviously inefficient.

The breakthrough was combining two techniques I'd been working on separately: adaptive classification (which can learn new categories without retraining) and an open source implementation of Pivotal Token Search from Microsoft's Phi-4 paper. When I put them together with dynamic token budgeting, the performance gains were much better than expected.

What surprised me most was that the technique actually uses fewer tokens on average while improving performance. The adaptive allocation means simple queries finish faster, offsetting the extra computation on complex ones.

A few technical notes:

- The steering vectors are small (typically <1MB per pattern) and add minimal memory overhead

- Classification adds about 10ms latency, which is negligible

- Target layer selection matters - I found middle layers (15-20) work best for most models

I'd love feedback on:

- Have you tried similar adaptive approaches with your models?

- What other reasoning patterns would be useful to steer toward?

- Ideas for automatically detecting the optimal target layer?

Thanks for checking it out! Happy to answer any questions about the implementation or results.

behnamoh · 2025-05-28T03:42:25 1748403745

> they spend the same amount of "thinking time" on "what's 2+2?" as they do on complex mathematical proofs.

Not anymore. Have you seen Gemini 2.5 Pro? Ask it simple questions and it almost doesn't "think". Ask it a coding question and it'll write a long reasoning article. I think the same goes for o3.

sigmoid10 · 2025-05-28T05:28:30 1748410110

The original o1 also didn't do this. Neither did the actual DeepSeek R1. You could even get it to answer immediately without any reasoning tokens. These highly distilled versions just lost most of their common sense for this.

shing3232 · 2025-05-28T05:57:53 1748411873

Well, it does overthink quite a bit. if It can reduce overthink,it s gonna be useful

victorbjorklund · 2025-05-28T07:14:22 1748416462

Overthink is subjectibe. It really depends on how much you value the answer.

"how long break distance does a train need if going in 100 km/hour?"

Just need a quick reply and you dont care so much (maybe showerthought)? Or is life and death depending on the answer?

The same question can need different amount of thinking.

normie3000 · 2025-05-28T09:41:32 1748425292

> is life and death depending on the answer?

In this situation I suspect you'd still want the answer quickly.

diggan · 2025-05-28T11:29:24 1748431764

Huge assumption, there is a wide range of various parameters that goes into how accurate you need an response to be, depending on context. As sure as there exists questions that you need 100% accurate response regardless of response times, I'm sure there exists questions on the other extreme.

GTP · 2025-05-28T14:02:40 1748440960

In this situation you would have someone with actual knowledge of the mechanics involved do the computation using the actual data (e.g., what's the mass of the train? Which kind of breaks does it have?) instead of asking an LLM and trusting it to give the correct answer without checking.

TeMPOraL · 2025-05-28T20:17:25 1748463445

Assuming you could find an expert like that in time, and that they will then be able to understand and solve the problem fast enough to still be helpful.

If you need the answer within a couple hours, you can probably get it for an expert; if you need to get an actionable answer within minutes, based on some back-of-the-envelope calculations, then a SOTA LLM is a much safer bet than flagging whoever seems the smartest in the room and asking them for help.

GTP · 2025-06-05T09:51:46 1749117106

I assumed we already did such calculations in advance, as it's needed to have proper safety measures.

victorbjorklund · 2025-05-29T21:49:20 1748555360

Why? Lets say your are designing a railway system. It does not matter if it takes 1 sec or an hour if the planning process are months long.

CjHuber · 2025-05-28T10:23:13 1748427793

What I really don't like is that I can't manually decide how much thinking it Gemini should allocate to a prompt. You're right sometimes it doesn't think but for me this also happens on complex query where I WOULD want it to think. Even things like "super think about this" etc don't help, it just refuses to

thegeomaster · 2025-05-28T11:08:11 1748430491

Gemini 2.5 Pro is getting thinking budgets when it GAs in June (at least that's the promise).

vladf · 2025-05-28T13:16:29 1748438189

This is available for Flash

codelion · 2025-05-28T03:50:22 1748404222

Yes, we started with the idea of trying to replicate similar control on thinking processes for open reasoning models. They also announced the Deep Think approach at IO which goes even further and combines parallel CoTs at inference.

CharlesW · 2025-05-28T12:37:15 1748435835

> I think the same goes for o3.

Definitely, in my experience. Elsewhere in the thread, OP says that open models/systems don't do this, in which case this seems like important work toward making open alternatives competitive.

olddustytrail · 2025-05-28T13:34:50 1748439290

Is that not just caching? If you have the same query just return the same response.

You could even put a simpler AI in front to decide if it was effectively the same query.

mclau157 · 2025-05-28T13:44:28 1748439868

Has Gemini or OpenAI put out any articles on this or is this just something you noticed?

Abishek_Muthian · 2025-05-28T04:50:51 1748407851

Congratulations! Any work to optimise efficiency w.r.t LLMs is much appreciated.

So far I’ve taken only lazy approach to optimising local LLMs by sending small queries to my M4 Mac Mini running MLX models and sending larger queries to my Nvidia 4090; it’s remarkable how efficient M4 is compared to Nvidia and I think Apple is in the right direction with MLX.

I would read about AutoThink and try to integrate it with my workflow.

Lerc · 2025-05-28T13:49:17 1748440157

I have thought it might be worth seeding responses with the output of non-reasoning models, so after the user prompt, inject a block of "a non-reasoning model thought this:... stuff ....Was that what the user wanted?" For the instances where the non reasoning version was sufficient it might help the reasoning model get to the point earlier.

codelion · 2025-05-28T13:52:22 1748440342

This is an interesting idea, I hadn't thought of it. It is worth experimenting I am not aware of anyone else trying it yet.

waffletower · 2025-05-28T15:46:32 1748447192

Claude Sonnet 3.5 (not even the latest iterations: 3.7 or 4) clearly adapts processing time to query complexity -- processing time is dynamic.

codelion · 2025-05-21T04:01:41 1747800101

You can try an open-source implementation - https://github.com/codelion/openevolve

codelion · 2025-05-20T22:26:52 1747780012

I actually managed to replicate the new SOTA for circle packing in unit squares as found in the alphaevole paper - 2.635 for 26 circles in a unit square. Took about 800 iterations to find the best program which itself uses an optimisation phase and running it lead to the optimal packaging in one of its runs.

helsinki · 2025-05-21T06:48:15 1747810095

How many tokens did it take to generate the 800 versions of the code?

codelion · 2025-05-21T22:18:19 1747865899

Checked my openrouter stats, it took ~3M tokens but that involved quite a few runs of various experiments.

codelion · 2025-04-28T05:42:27 1745818947

Not standard but one of several techniques, you can see them in our open source inference proxy - https://github.com/codelion/optillm

Cerebras has used optillm for optimising inference with techniques like CePO and LongCePO.

codelion · 2025-04-23T09:12:12 1745399532

Do other services have the same problem? Like the https://amibreached.com/ ?

randunel · 2025-04-23T10:21:31 1745403691

Only getting 500 errors from the search call:

Request URL: https://api-v3.amibreached.com/api/v1/cyble-it?SearchTerm=as... Request Method: GET Status Code: 500 Internal Server Error

codelion · 2025-04-22T12:13:35 1745324015

I think you've nailed the key point. A lot of "coding" isn't actually writing code, but understanding the problem space and designing a good solution. If I'm spending too long wrestling with the implementation, it's usually a sign that I didn't fully grasp the problem upfront or my design is flawed. Good tooling helps, for sure, but it's no substitute for solid problem analysis.

codelion · 2025-04-22T02:10:11 1745287811

It's true, predicting Nvidia's downfall has become a recurring theme. It's easy to underestimate a company that consistently adapts and innovates. Maybe the narrative isn't about "stealing their lunch" but rather carving out specialized niches.