FRESH

Hacker News

Home

Kimi K2.6 just beat Claude, GPT-5.5, and Gemini in a coding challenge

349 points by bazlightyear

by 0xbadcafebee

10 subcomments

These posts are going to be a constant for the next year, because there's no objective way to compare models (past low-level numbers like token generation speed, average reasoning token amount, # of parameters, active experts, etc). They're all quite different in a lot of ways, they're used for many different things by different people, and they're not deterministic. So you're constantly gonna see benchmarks and tests and proclamations of "THIS model beat THAT model!", with people racing around trying to find the best one.
But there is no best one. There's just the best one for you, based on whatever your criteria is. It's likely we'll end up in a "Windows vs MacOS vs Linux" style world, where people stick to their camps that do a particular thing a particular way.

by gertlabs

6 subcomments

I'm glad we're seeing a shift towards objectively scored tests.
We've been doing this at scale at https://gertlabs.com/rankings, and although the author looks to be running unique one-off samples, it's not surprising to see how well Kimi K2.6 performed. Based on our testing, for coding especially, Kimi is within statistical uncertainty of MiMo V2.5 Pro for top open weights model, and performs much better with tools than DeepSeek V4 Pro.
GPT 5.5 has a comfortable lead, but Kimi is on par with or better than Opus 4.6. The problem with Kimi 2.6 is that it's one of the slower models we've tested.

by noashavit

0 subcomment

Why are we still discussing which model is “the best”. The model is just one small part of what you need. Think agentic harness, data governance, AI guardrails, machine access controls, etc

by ninjahawk1

7 subcomments

At the current rate, open sourced models are expected to surpass cloud models within a couple years based on a study I read a couple days ago.
Looking back at chatGPT and claude a couple years ago, very small Qwen models are basically equal in coding to what those cloud based models could do then. Also factoring in scaling laws, a 9b going to 18b is roughly a 40% increase, whereas 18b to 35b is 20%, I expect there will be a change of at least price in cloud based models.
Adobe used to be $600 per month, then it became $20 when distribution scaled.

by sieve

1 subcomments

Kimi is really good.
I have been using Sonnet and others (DeepSeek, ChatGPT, MiniMax, Qwen) for my compiler/vm project and the Claude Pro plan is mostly unusable for any serious coding effort. So I use it in chat mode in the browser where it cannot needlessly read your entire project, and use Kimi on the OpenCode Go plan with pi.
Kimi consistently exceeded Sonnet on the C+Python project. Never had to worry about it doing anything other than what I asked it to do. GLM crapped the bed once or twice. Kimi never did.

by magicalhippo

6 subcomments

In a single challenge, measured by how performant the solution was.
Kimi K2.6 is definitely a frontier-sized model, so on the one hand it's not that surprising it's up there with the closed frontier models.
Being open is nice though, even though it doesn't matter that much for folks like me with a single consumer GPU.

by slashdave

4 subcomments

I was surprised by the ranking, until I read what the test was. Not horribly relevant for coding.
The current ranking of all tests makes more sense (well, except for how well Gemini does)
https://aicc.rayonnant.ai

by prirun

1 subcomments

I don't know much about the AI field, but it seems to me that trying to train any model to be all things to all people is a really dumb idea. It requires huge financial resources and is causing extreme shortages/market distortions in any resource used by an AI company - RAM, SSDs, data centers, etc.
In the real world, you don't hire a plumber and expect him to also do your landscaping, fix your car, and tailor your clothes. It would seem like a much better use of resources if I could download an app that specialized in shell, Python, and C coding for example, or maybe even that would be 3 apps that communicated. Maybe I could even run them on a regular machine with 16GB of RAM. I don't need one huge model that can do that and code in Fortran, COBOL, and Lisp.
As humans, we've done pretty well by specializing. I hope this gets explored more with smaller, focused AI models vs the current path of one model to rule them all that can only be run in a data center the size of a country.

by yanis_t

0 subcomment

Anecdotal, but having used Claude Code exclusively for the last several months, I was pleasantly surprised by how capable Pi + Kimi K2.6 is. It's also much faster (via OpenRouter), and at a fraction of the cost.

by ponyous

2 subcomments

Kimi is nowhere near GPT or Opus unfortunately. I really wish it was. I’m running evals where models have to generate code that produces 3D models and it’s obvious that it lacks spatial understanding and makes many more code errors before it succeeds.
Maybe it’s better in one particular case here and there and I think this blog post is example of that.

by aykutseker

0 subcomment

This seems less like Kimi is better at coding than Claude and more like Kimi found the right strategy for this particular game.
Still interesting though. The fact that an open weight model is close enough for that to matter is probably the real story.

by pllbnk

0 subcomment

What would prevent Anthropic or Open AI take one of the open source models, tune it to "feel" like one of their models and expose them under their respective names (GPT or Claude) to save on training and inference, but keep their brand name's popularity? If the models are not objectively worse and are cheaper to run, then nobody would complain, or complaining baseline would stay the same.

by adrian_b

0 subcomment

> Xiaomi confirming that weights for their newer V2.5 Pro model are dropping soon
This has already happened.
I have downloaded both the big Pro model and the smaller but multimodal MiMo-V2.5.
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro
https://huggingface.co/XiaomiMiMo/MiMo-V2.5
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro-Base
https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Base
The download of MiMo-V2.5-Pro takes 963 GB, while that of MiMo-V2.5 takes 295 GB.
For comparison, the download of Kimi-K2.6 takes 555 GB.

by justech

0 subcomment

I’ve been maining Kimi k2.6 through opencode go and openrouter for a week and I can say it’s the same experience as when I was maining Sonnet 3.5/4 late last year.
Not as good or as fast as Claude Code on Opus now but definitely enough for casual/hobby use. The best part is multiple choices for providers, if opencode gimps their service, I’ll switch

by codedokode

0 subcomment

It's interesting that OpenAI promised to make AI accessible for everyone, but China actually did it.

by zmmmmm

0 subcomment

I've been switching across all different models this week with OpenCode and Pi - we're in an interesting place now because the open models are definitely "good enough" for a wide range of coding tasks and MUCH cheaper. They certainly aren't AS good, especially once you get into unfamiliar territory - custom enterprise frameworks etc where model knowledge falls off and general intelligence kicks in. But then, with time people will build up custom skills and agent files for those. And the open models will also get better.
I could easily see us in a place 2 years from now where this coding application is fully commoditised.

by jrecyclebin

1 subcomments

I absolutely love Kimi's personality - some of the things it says are so out there! And it's been great for very focused, iterative work.
Its weakness is that it seems to yak on-and-on when it needs to plan out something big or read through and make sense of how to use a niche piece of a complex library. To the point where it can fill up its 256k window - and rack up a build. (No cache.) I have had better experience with GLM 5.1 in those cases.
Anyone out there relate?

by kmkrworks

0 subcomment

I don't feel like this is an optimal way of comparing models. I really don't think any metric as of now has the ability to list down the best model as of now. It prioritizes tasks over the overall ability, and I don't even think it's possible to.

by bazlightyear

0 subcomment

BTW it looks like Kimi won the subsequent challenge too https://aicc.rayonnant.ai/challenges/hexquerques/

by elromulous

1 subcomments

Is the site just slashdotted rn? Can anyone get to it?

by syntex

0 subcomment

These benchmarks means very little. The real test is model + harness so agentic system that can fulfill given goals.

by raffael_de

0 subcomment

I find this practically relevant insofar K2.6 is offered by Kagi in the 2 lower priced plans for $5 and $10 per month.

by PedroBatista

1 subcomments

Great to know, but what was the cost both in terms of $$ and tokens used?
Not to invalidate these benchmark results because they are useful, but the real usefulness it what they are capable to do when real people interact with them at scale.
Regardless, these are good news, because now that Microsoft is basically giving up their all-in strategy with Github's Copilot and Anthropic is playing the "I'm too good for you" game, it's about time for them to get pressed into not making this AI world into a divide between the haves and the have-nots.

by _pdp_

1 subcomments

Kimi is capable model but it needs a very good harness. With a good harness it is a very capable model. But it can get into all kinds of issues (loops and such) something that frontier models do not.
As I said, you can blame the model, but it is nothing that the harness cannot take care of more deterministically.

by Frannky

2 subcomments

I have to try Kimi. I was looking for an alternative. If you have any experience, advice, please share. I saw Kimi is at the top of the Open Router ranking.

by gizmodo59

0 subcomment

“I did not wake up to be a loser. This loser attitude, makes no sense to me.” - Frontier Model Labs (original quote by Jensen in a podcast)

by ajdegol

0 subcomment

I’ve been wondering about potential regression in coding models.
The initial models were corrected by programmers which gave a very high quality feedback signal. Whereas with vibe coding on the rise, you’ll lose that signal.

0 subcomment

by SomaticPirate

0 subcomment

This seems to be testing the models on leetcode style prompts that also require the model to implement TCP calls to send the results. Interesting but probably not a apples to apples comparison. The fact only Grok qualified for the first one seems suspect

by alex7o

0 subcomment

I don't know about you but kimi 2.6 from the kimi subscription has been absolutely bad and useless for the past 1 week so I canceled my sub and stopped using it.

by beering

1 subcomments

I’m a little confused as to the setup. It was asking each model to one-shot a script and then the scripts faced off? Were the models given a computer environment? Or a test server to iterate against?

by koala-news

0 subcomment

In my opinion, this kind of comparison is not very meaningful.

by imrozim

0 subcomment

Same experience here i use open router with claude as fallback for my startup. Is Kimi if close in quality the cost is difference hard to ignore

by warabe

0 subcomment

I’m not trying to add fuel to the fire, but will OpenAI and Anthropic’s IPO go smoothly?

by bjoli

0 subcomment

As a musician, I find the butchering of musical notation on Kimis pricing page extremely off-putting.

by muti

0 subcomment

Doesn't seem like a very insightful result. Kimi won with the naive strategy. Other models didn't slide tiles at all or didn't demonstrate understanding of the rules, claiming words that lost points. A strategy that did nothing would beat them.
We know these models can solve much more difficult problems, something isn't right.

by slopinthebag

1 subcomments

Amazing. To me it feels like GLM 5.1, Kimi 2.6, DeepSeek 4 are all competitive both with each other and with the American models. Truly a great time to be alive.
I would like to see more effort making the flash variants work for coding. They are super economical to use to brute force boilerplate and drudgery, and I wonder just how good they can be with the right harness, if it provides the right UX for the steering they require.
As much as vibe coding has captured the zeitgeist, I think long term using them as tools to generate code at the hands of skilled developers makes more sense. Companies can only go so long spending obscene amounts of money for subpar unmaintainable code.

by jakemanger

2 subcomments

What's the GPU VRAM requirements for this thing?
Awesome to have a open model that can compete, but damn it would be so much better if you could run it locally. Otherwise, it's almost so difficult to run (e.g. self host) that it's just way more convenient to pay OpenAI, Claude, etc

by gherkinnn

0 subcomment

I never looked in to the details of these benchmarks, I live with the assumptions that most benchmarks of any kind are gamed and useless.
What I do see in my own work and that of others around me, is that Claude consistently outperforms Gemini and to a lesser extent Codex.
With Claude eating tokens with declining return, concessions have to be made and Codex is a usable middle ground.
I use Kimi in Kagi's Assistant for non-code or generic programming questions and am quite happy with its no-bullshit responses.

by rvz

0 subcomment

So we are now at the point where open weight models are rapidly catching up to the frontier models.
They are at best 30 days behind, and at worst case 2 months behind. The last issue is being able to run the best one on conventional hardware without a rack of GPUs.
The Macbooks, and Mac minis are behind on hardware but eventually in the next 2 years at worst will make it possible thanks to the advancements of the M-series machines.
All of this is why companies like Anthropic feel like they have to use "safety" to stop you from running local models on your machine and get you hooked on their casino wasting tokens with a slot machine named Claude.

by pbreit

1 subcomments

All my co-workers say Claude blows away Gemini. Is it really that good? How can I do Kimi?

by VeejayRampay

0 subcomment

crazy how people on hacker news, who just gobble up anything if it's from openai or anthropic suddenly become monocled sceptics when chinese open models are "winning"

by wg0

0 subcomment

About 40% of stock market consists of about 7 or 8 companies. Those companies that are all into AI circular deals collectively trillions of dollars in valuations.
Now imagine a company burning 200,000/month on AI spend. Real numbers. Not every company is but some are.
Why such a company won't deploy an open weight model (Kimi 2.6 or Deepseek v4) on their own hardware (rented or otherwise) to save about 2.4 million dollars a year?
And these are the landmines Chinese cleverly did set up. Not saying intentionally or otherwise.
But end result is that good luck recouping your investments, you can pretty much kiss goodbye to any ROI. The bucket has a hole at the bottom and the bubble bust is guaranteed.
PS: Without open weight models too the economics do not make sense neither the code generated by these SOTA models is reliable enough to be deployed as is. Anyone claiming otherwise either hasn't worked on a real software stack with real users OR didn't use AI long enough to witness the AI slop and how hard it is to untangle or de-slopify the AI generated code therefore these trillion dollar valuations are absurd anyway.

by childintime

1 subcomments

Is there a lo-slop model that stands out when using Zig?

by PunchyHamster

0 subcomment

That is not a programming challenge, the fuck

by qakajjqj

0 subcomment

Yes gimini is a programming application

by walrus01

2 subcomments

People thinking to self-host Kimi K2.6 had better be prepared for how big it is.
Q8 K XL quantization for instance is around 600GB on disk. I would bet about 700GB of VRAM needed.
Quantizations lower than Q8 are probably worthless for quality.
Or 2.05TB on disk for the full precision GGUF.
https://huggingface.co/unsloth/Kimi-K2.6-GGUF
If you can afford the hardware to run Kimi K2.6 at any decent speed for more than 1 simultaneous user, you probably have a whole team of people on staff who are already very familiar with how to benchmark it vs Claude, GPT-5.5, etc.

by ant6n

1 subcomments

What I would like to see is a comparison of how well the models work in long running conversations:
```
  * do they lie and gaslight

  *  do they start breaking down on very long chats (forget old context, just get dumber)

  * do they constantly try to tell me how smart I am vs solving the problem (yes man)

  * do they follow conventions, parameters set out early in the prompts, or forget them

  * if they cant read a given file (like pdf), do they lie about it

  * is there a branch function to go back to earlier state of conversation

  * what is the quality of the presentation of results (structure, wording, excessive use of tables, appropriate use of headings)

  * how does the bot deal with user frustration (empathy?)
```
For example Chatpgt 5.5 is fairly smart, but presentation of results is kind of poor and unstructured, and unnecessarily long. It will break down on long conversations (the long answers dont help here), and it can’t deal with that except lying and gaslighting. It also has very little empathy, and mostly ignores user frustration. But at least theres branching, so one can go back without completely starting over.
Gemini doesnt feel quite as smart these days. It does well with very long conversations. Except it has bugs where all context gets lost or pruned, and it will lie and gaslight about it. Theres also no branching, so once context is lost you have to start over. Presentation is decent. Empathy is fairly good, except if users get frustrated, it gets more and more flustered and breaks down.

by surrTurr

0 subcomment

[dead]

by ibrahimhossain

0 subcomment

[flagged]

by Rekindle8090

0 subcomment

[dead]

by tim0414

0 subcomment

[dead]

by chillfox

0 subcomment

Meanwhile, I can’t get kimi k2.6 to edit a heredoc in a shell script without it fucking it up.

by plexescor

0 subcomment

I always though claude is the goat, but i guess its time to change the notion and try Kimi K2.6