More

codelion · 2025-11-02T23:19:44 1762125584

When mammals hunt other mammals strange things can happen.

codelion · 2025-09-01T22:54:51 1756767291

Whom to believe? Devin or Claude? - https://www.anthropic.com/engineering/multi-agent-research-s...

KaoruAoiShiho · 2025-09-02T00:35:14 1756773314

If you actually read the claude article it says the same things as the cognition article, it just has a different definition of multi-agent.

codelion · 2025-08-11T06:43:07 1754894587

It is by design. OpenAI is not going to reveal any architectural innovation they have made in their own commercial models.

diggan · 2025-08-11T07:43:50 1754898230

Maybe not a architectural innovation, but both the Harmony format and splitting things into system/developer/user messages instead of just system/user messages, are both novel (in the released weights world) and different enough that I'm still in the process of updating my libraries so I can run fair benchmarks...

codelion · 2025-06-02T12:23:28 1748867008

You can run in two modes, by default you run in the inference mode without learning. So, the changes you made will be used. If you switch to learning mode then the strategies are updated/refined and merged based on a config that you can control.

# How often to perform maintenance operations (merge, prune)

MAINTENANCE_INTERVAL = 40

# Strategy selection thresholds

STRATEGY_CREATION_THRESHOLD = 0.7 # Higher threshold to avoid creating similar strategies

STRATEGY_MERGING_THRESHOLD = 0.6 # Lower threshold to merge more similar strategies

MIN_SUCCESS_RATE_FOR_INFERENCE = 0.4 # Minimum success rate for a strategy to be used during inference

The configs are all defined here - https://github.com/codelion/optillm/blob/main/optillm/plugin...

codelion · 2025-06-02T12:08:24 1748866104

We do not allow the strategies to keep growing there is a refinement phase where we refine and merge existing strategies. The experiments were run with this config - https://github.com/codelion/optillm/blob/main/optillm/plugin... which allows a maximum of 10 strategies of each type.

codelion · 2025-06-02T11:04:13 1748862253

Re-reading the problem apparently works well - https://arxiv.org/abs/2309.06275

Here the system seems to have discovered this strategy by itself. The prompts are generic because during learning there is a part to refine and combine them. I haven’t experimented yet by adding all prompts to every query, given the large context it will be interesting to see.

yunusabd · 2025-06-02T11:22:44 1748863364

Okay, but it looks like in the paper, they are actually adding the question twice in the prompt, not just instructing the model to read it twice. Or am I missing something?

codelion · 2025-06-02T08:52:53 1748854373

We have some examples in the plugin README: https://github.com/codelion/optillm/tree/main/optillm/plugin...

E.g. This was the strategy discovered by optiLLM for solving word problems:

*Refined Strategy for Solving Word Problems:*

1. *Understand:*\n * Read the problem carefully (multiple times).\n * Identify the question (what are you trying to find?).\n * List all given information (facts, numbers, units).\n * Clarify ambiguous terms/units.

2. *Organize Information & Identify Unknowns:*\n * Choose an organization method: (e.g., table, diagram, list, drawing).\n * Clearly identify the unknowns (what you need to solve for).

3. *Plan and Translate:*\n * Define all variables with units (e.g., `p = number of pennies`, `c = number of compartments`).\n * Identify relationships between knowns and unknowns.\n * Convert units if necessary.\n * Write equations or expressions, including units, that relate the knowns and unknowns.\n * Ensure units are consistent throughout the equations.\n * Outline the solution steps.

4. *Solve:*\n * Show work step-by-step.\n * Track units throughout calculations.\n * Calculate accurately.\n * Solve for the unknowns.\

5. *Evaluate and Verify:*\n * Check if the answer is reasonable.\n * Verify the answer.

6. *Summarize:*\n * State the answer with units

Full list of strategies discovered is available here -https://github.com/codelion/optillm/blob/main/optillm/plugin...

codelion · 2025-06-02T07:32:00 1748849520

Optillm works with llama.cpp but this approach is implemented as a decoding strategy in PyTorch so at the moment you will need to use the local inference server in optillm to use it.

codelion · 2025-06-02T07:30:39 1748849439

Thanks for checking this out! A few additional details that didn't fit in the main post:

The system maintains two separate limits: a storage limit (max 10 strategies per problem type in the database) and an inference limit (max 3 strategies applied per query). This keeps the database manageable while ensuring the system prompt doesn't get too long.

One interesting finding was that strategies only get used for inference once they have at least 5 attempts and a 40% success rate. This prevents the system from applying unproven strategies to new problems.

The approach works particularly well with reasoning models like DeepSeek-R1 and QwQ - the learned strategies seem to guide their thinking process effectively.

I'm especially curious about:

1. How this might work with different model families

2. Whether the community sees value in sharing strategy databases between users

3. Ideas for extending beyond text-based reasoning to multimodal problems

The plugin integrates with our broader optillm project which has other inference optimization techniques. You can combine SPL with methods like mixture-of-agents or MCTS using the "&" operator.

Next I'm thinking about meta-learning - having the system learn how to create better strategies more efficiently. Also exploring collaborative strategy sharing.

Would love to hear thoughts on the approach or if anyone has ideas for other problem domains where this might be useful!

codelion · 2025-05-28T13:41:28 1748439688

Hey, yes the reported results do not restrict any time limit or token limit for the benchmarks. We run our baseline with the same config 0.6 temp and max_token 32k but we set a timeout after 600 secs. Otherwise it would take forever to benchmark with the resources we had. I have a note in the actual paper on that in the implementation details section.

lostmsu · 2025-05-28T14:46:31 1748443591

GPQA-Diamond is 200 questions. Any GPU since 2019 with 12GB of VRAM should be able to run tens if not hundreds of queries for a 1.5B model in parallel.

codelion · 2025-05-28T15:02:09 1748444529

If we try to benchmark GPQA-Diamond with DeepSeek-R1 in the suggested configuration of 0.6 temp and 32k max_tokens and say if every instance takes the maximum tokens it will require 6.4 M tokens. Which without batching on a single H100 at 80 tok/s will take 23 hrs to run. To run with 32k context length on a single H100 a 1.5B model will require ~15-20 GB VRAM so you cannot run 10s or 100s of queries in parallel.

MMLU-PRO is 12,000 instances. To avoid this we set a 600 seconds timeout for each instance to run.