FRESH

Hacker News

Home

Case study: Creative math – How AI fakes proofs

122 points by musculus

by simonw

6 subcomments

Somewhat ironic that the author calls out model mistakes and then presents https://tomaszmachnik.pl/gemini-fix-en.html - a technique they claim reduces hallucinations which looks wildly superstitious to me.
It involves spinning a whole yarn to the model about how it was trained to compete against other models but now it's won so it's safe for it to admit when it doesn't know something.
I call this a superstition because the author provides no proof that all of that lengthy argument with the model is necessary. Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect?

by v_CodeSentinal

6 subcomments

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.
The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

by threethirtytwo

1 subcomments

You don’t need a test to know this we already know there’s heavy reinforcement training done on these models so it optimizes for passing the training. Passing the training means convincing the person rating the answers and that the answer is good.
The keyword is convince. So it just needs to convince people that’s it’s right.
It is optimizing for convincing people. Out of all answers that can convince people some can be actual correct answers, others can be wrong answers.

by comex

3 subcomments

I like how this article was itself clearly written with the help of an LLM.
(You can particularly tell from the "Conclusions" section. The formatting, where each list item starts with a few-word bolded summary, is already a strong hint, but the real issue is the repetitiveness of the list items. For bonus points there's a "not X, but Y", as well as a dash, albeit not an em dash.)

by benreesman

1 subcomments

They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC.

by zadwang

1 subcomments

The simpler and I think correct conclusion is that the LLM simply does not reason in our sense of the word. It mimics the reasoning pattern and try to get it right but could not.

by mlpoknbji

0 subcomment

This also can be observed with more advanced math proofs. ChatGPT 5.2 pro is the best public model at math at the moment, but if pushed out of its comfort zone will make simple (and hard to spot) errors like stating an inequality but then applying it in a later step with the inequality reversed (not justified).

by tombert

3 subcomments

I remember when ChatGPT first came out, I asked it for a proof for Fermat's Last Theorem, which it happily gave me.
It was fascinating, because it was doing a lot of understandable mistakes that 7th graders make. For example, I don't remember the surrounding context but it decided that you could break `sqrt(x^2 + y^2)` into `sqrt(x^2) + sqrt(y^2) => x + y`. It's interesting because it was one of those "ASSUME FALSE" proofs; if you can assume false, then mathematical proofs become considerably easier.

by bwfan123

0 subcomment

I am actually surprised that the LLM came so close. I doubt it had examples in its training set for these numbers. This goes to the heart of "know-how". The LLM should should have said: "I am not sure" but instead gets into rhetoric to justify itself. It actually mimics human behavior for motivated reasoning. At orgs, management is impressed with this overconfident motivated reasoner as it mirrors themselves. To hell with the facts, and the truth, persuation is all that matters.

by aniijbod

0 subcomment

In the theory of the psychology of creativity, there are phenomena which constitute distortions of the motivational setting for creative problem-solving which are referred to as 'extrinsic rewards'. Management theory bumped into this kind of phenomenon with the advent of the introduction of the first appearance of 'gamification' as a motivational toolkit, where 'scores' and 'badges' were awarded to participants in online activities. The psychological community reacted to this by pointing out that earlier research had shown that whilst extrinsics can indeed (at least initially) boost participation by introducing notions of competitiveness, it turned out that they were ultimately poor substitutes for the far more sustainable and productive intrinsic motivational factors, like curiosity, if it could be stimulated effectively (something which itself inevitably required more creativity on the part of the designer of the motivational resources). It seems that the motivational analogue in inference engines is an extrinsic reward process.

by godelski

1 subcomments

I thought it funny a few weeks ago Karpathy shared a sample od NanoBannana solving some physics problems but despite getting the right output it isn't get the right answers.
I think it's quite illustrative of the problem even with coding LLMs. Code and math proofs aren't so different, what matters is the steps to generate the output. All that matters far more than the actual output. The output is meaningless if the steps to get there aren't correct. You can't just jump to the last line of a proof to determine its correctness and similarly you can't just look at a program's output to determine its correctness.
Checking output is a great way to invalidate them but do nothing to validate.
Maybe what surprised me most is that the mistakes NanoBananna made are simple enough that I'm absolutely positive Karpathy could have caught them. Even if his physics is very rusty. I'm often left wondering if people really are true believers and becoming blind to the mistakes or if they don't care. It's fine to make mistakes but I rarely see corrections and let's be honest here, these are mistakes that people of this caliber should not be making.
I expect most people here can find multiple mistakes with the physics problem. One can be found if you know what the derivative of e^x is and another can be found if you can count how many i's there are.
The AI cheats because it's focused on the output, not the answer. We won't solve this problem till we recognize the output and answer aren't synonymous
https://xcancel.com/karpathy/status/1992655330002817095

by RugnirViking

0 subcomment

it seems to me like this is very much an artefact of the left-to-right top-down writing method of the program. Once its committed to a token earlier in its response it kinda just has to go with it. Thats why im so interested in those LLM models that work more like stable diffusion, where they can go back and iterate repeatedly on the output.

by James_K

0 subcomment

What's interesting about this is that a human would hypothetically produce a similar error, but in practice would reject the question as beyond their means. I'd assume something about supervised learning makes the models overestimate their abilities. It probably learns that “good” responses attempt to answer the question rather than giving up.

by citizenpaul

0 subcomment

>STEP 2: The Shock (Reality Check)
I've found a funny and simple technique for this. Just write "what the F$CK" and it will often seem to unstick from repetitiveness or refusals(i cant do that).
Actually just writing the word F#ck often will do it. Works on coding too.

by inimino

0 subcomment

So many words to say "it predicts the later tokens in light of those already emitted."

0 subcomment

by semessier

3 subcomments

that's not a proof

by segmondy

0 subcomment

if you want to do math proofs use AI built for proof
https://huggingface.co/deepseek-ai/DeepSeek-Math-V2
https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B

by fragmede

3 subcomments

> a session with Gemini 2.5 Pro (without Code Execution tools)
How good are you at programming on a whiteboard? How good is anybody? With code execution tools withheld from me, I'll freely admit that I'm pretty shit at programming. Hell, I barely remember the syntax in some of the more esoteric, unpracticed places of my knowledge. Thus, it's hard not to see case studies like this as dunking on a blindfolded free throw shooter, and calling it analysis.

by calhoun137

0 subcomment

My experience leads to the same conclusion that the models are very good at math reasoning, but you have to really know what you are doing and be aware of the blatant lies that result from poorly phrased queries.
I recently prompted Gemini Deep Research to “solve the Riemann Hypothesis” using a specific strategy and it just lied and fabricated the result of a theorem in its output, which otherwise looked very professional.

by drumnerd

0 subcomment

This is so obvious I am amazed it warrants a post.

by zkmon

2 subcomments

We are entering into a probabilistic era where things are not strictly black and white. Things are not binary. There is no absolute fake.
A mathematical proof is an assertion that a given statement belongs to the world defined by a set of axioms and existing proofs. This world need not have strict boundaries. Proofs can have probabilities. Maybe Reimann's hypothesis has a probability of 0.999 of belonging to that mathematical box. New proofs that would have their own probability which is a product of probabilities of the proofs they depend on. We should attach a probability and move on. Just like how we assert that some number is probably prime.

0 subcomment

by aathanor

0 subcomment

I’m not a coder, but I’ve been working extensively on the philosophical aspects of AI. Many technical people are influenced by an algorithmic view of intelligence, primarily because this aligns with programming and the general understanding of reasoning. However, pattern recognition, which is fundamental to LLMs, is not algorithmic. Consider this: a LLM constructs a virtual textual world where landscapes and objects are represented as text, and words are the building blocks of these features. It’s a vast 700+D mathematical space, but visualizing it as a virtual reality environment can help us comprehend its workings. When you provide a prompt, you essentially direct the LLM’s attention to a specific region within this space, where an immense number of sentences exist in various shapes and forms (textual shapes). All potential answers generated by the LLM are contained within this immediate landscape, centered around your prompt’s position. They are all readily available to the LLM at once.
There are certain methods (I would describe them as less algorithmic and more akin to selection criteria or boundaries) that enable the LLM to identify a coherent sequence of sentences as a feature closer to your prompt within this landscape. These methods involve some level of noise (temperature) and other factors. As a result, the LLM generates your text answer. There’s no reasoning involved; it’s simply searching for patterns that align with your prompt. (It’s not at all based on statistics and probabilities; it’s an entirely different process, more akin to instantly recognizing an apple, not by analyzing its features or comparing it to a statistical construct of “apple.”)
When you request a mathematical result, the LLM doesn’t engage in reasoning. It simply navigates to the point in its model’s hyperspace where your prompt takes it and explores the surrounding area. Given the extensive amount of training text, it will immediately match your problem formulation with similar formulations, providing an answer that appears to mimic reasoning solely because the existing landscape around your prompt facilitates this.
A LLM operates more like a virtual reality environment for the entire body of human-created text. It doesn’t navigate the space independently; it merely renders what exists in different locations within it. If we were to label this as reasoning, it’s no more than reasoning by analogy or imitation. People are right to suspect LLMs do not reason, but I think the reason (pun intended) for that is not that they simply do some sort of statistical analysis. This "stochastic parrots" paradigm supported by Chomsky is actually blocking our understanding of LLMs. I also think that seeing them as formidable VR engines for textual knowledge clarifies why they are not the path to AGI. (There is also the embodiment problem which is not solvable by adding sensors and actuators, as people think, but for a different reason)

by marsven_422

0 subcomment

[dead]

by rakmo

1 subcomments

Is this hallucination, or is this actually quite human (albeit a specific type of human)? Think of slimy caricatures like a used car salesman, isn't this the exact type of underhandedness you'd expect?