The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).
I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.
There is already a problem with papers falsifying data/samples/etc, LLMs being able to put out plausible papers is just going to make it worse.
On the bright side, maybe this will get the scientific community and science journalists to finally take reproducibility more seriously. I'd love to see future reporting that instead of saying "Research finds amazing chemical x which does y" you see "Researcher reproduces amazing results for chemical x which does y. First discovered by z".
> When reached for comment, the NeurIPS board shared the following statement: “The usage of LLMs in papers at AI conferences is rapidly evolving, and NeurIPS is actively monitoring developments. In previous years, we piloted policies regarding the use of LLMs, and in 2025, reviewers were instructed to flag hallucinations. Regarding the findings of this specific work, we emphasize that significantly more effort is required to determine the implications. Even if 1.1% of the papers have one or more incorrect references due to the use of LLMs, the content of the papers themselves are not necessarily invalidated. For example, authors may have given an LLM a partial description of a citation and asked the LLM to produce bibtex (a formatted reference). As always, NeurIPS is committed to evolving the review and authorship process to best ensure scientific rigor and to identify ways that LLMs can be used to enhance author and reviewer capabilities.”
1. Doxxing disguised as specific criticism: Publishing the names of authors and papers without prior private notification or independent verification is not how academic corrections work. It looks like a marketing stunt to generate buzz at the expense of researchers' reputations.
2. False Positives & Methodology: How does their tool distinguish between an actual AI "hallucination" and a simple human error (e.g., a typo in a year, a broken link, or a messy BibTeX entry)? Labeling human carelessness as "AI fabrication" is libelous.
3. The "Protection Racket" Vibe: The underlying message seems to be: "Buy our tool, or next time you might be on this list." It’s creating a problem (fear of public shaming) to sell the solution.
We should be extremely skeptical of a vendor using a prestigious conference as a billboard for their product by essentially publicly shaming participants without due process.
(If you're qualified to review papers, please email the program chair of your favorite conference and let them know -- they really need the help!)
As for my review, the review form has a textbox for a summary, a textbox for strengths, a textbox for weaknesses, and a textbox for overall thoughts. The review I received included one complete set of summary/strengths/weaknesses/closing thoughts in the summary text box, another distinct set of summary/strengths/weaknesses/closing thoughts in the strengths, another complete and distinct review in the weaknesses, and a fourth complete review in the closing thoughts. Each of these four reviews were slightly different and contradicted each other.
The reviewer put my paper down as a weak reject, but also said "the pros greatly outweigh the cons."
They listed "innovative use of synthetic data" as a strength, and "reliance on synthetic data" as a weakness.
By using an LLM to fabricate citations, authors are moving away from this noble pursuit of knowledge built on the "shoulders of giants" and show that behind the curtain output volume is what really matters in modern US research communities.
Most big tech PhD intern job postings have NeurIPS/ICML/ICLR/etc. first author paper as a de facto requirement to be considered. It's like getting your SAG card.
If you get one of these internships, it effectively doubles or triples your salary that year right away. You will make more in that summer than your PhD stipend. Plus you can now apply in future summers and the jobs will be easier to get. And it sets your career on a good path.
A conservative estimate of the discounted cash value of a student's first NeurIPS paper would certainly be five figures. It's potentially much higher depending on how you think about it, considering potential path dependent impacts on future career opportunities.
We should not be surprised to see cheating. Nonetheless, it's really bad for science that these attempts get through. I also expect some people did make legitimate mistakes letting AI touch their .bib.
If we grant that good carrots are hard to grow, what's the argument against leaning into the stick? Change university policies and processes so that getting caught fabricating data or submitting a paper with LLM hallucinations is a career ending event. Tip the expected value of unethical behaviours in favour of avoiding them. Maybe we can't change the odds of getting caught but we certainly can change the impact.
This would not be easy, but maybe it's more tractable than changing positive incentives.
>GPTZero finds 100 new hallucinations in NeurIPS 2025 accepted papers
And I'm left wondering if they mean 100 papers or 100 hallucinations
The subheading says
>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations
Which accidentally a word, but seems to clarify that they do legitimately mean 100 papers.
A later heading says
>Table of 100 Hallucinated Citations in Published Across 53 NeurIPS Papers
Which suggests either the opposite, or that they chose a subset of their findings to point out a coincidentally similar number of incidents.
How many papers did they find hallucinations in? I'm still not certain. Is it 100, 53 or some other number altogether? Does their quality of scrutiny match the quality of their communication. If they did in-fact find 100 Hallucinations in 53 papers, would the inconsistency against their claim of "papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations" meet their own bar for a hallucination?
GPTZero of course knows this. "100 hallucinations across 53 papers at prestigious conference" hits different than "0.07% of citations had issues, compared to unknown baseline, in papers whose actual findings remain valid."
I guess GPTZero has such a tool. I'm confused why it isn't used more widely by paper authors and reviewers
When training a student, normally we expect a lack of knowledge early, and reward self-awareness, self-evaluation and self-disclosure of that.
But the very first epoch of a model training run, when the model has all the ignorance of a dropped plate of spaghetti, we optimize the network to respond to information, as anything from a typical human to an expert, without any base of understanding.
So the training practice for models is inherently extreme enforced “fake it until you make it”, to a degree far beyond any human context or culture.
(Regardless, humans need to verify, not to mention read, the sources they site. But it will be nice when models can be trusted to accurately access what they know/don’t-know too.)
a) p-hacking and suppressing null results
b) hallucinations
c) falsifying data
Would be cool to see an analysis of this
Not great, but to be clear this is different from fabricating the whole paper or the authors inventing the citations. (In this case at least.)
Also: there were 15 000 submissions that were rejected at NeurIPS; it would be very interesting to see what % of those rejected were partially or fully AI generated/hallucinated. Are the ratios comperable?
I'm sure plenty of more nuanced facts are also entirely without basis.
At work I've automated tools to write automated technical certificates for wind parks.
I've wrote code automatically to solve problems I couldn't solve by my own. Complicated Linear Algebra stuff, which was always too hard.
I should have written papers automatically, at least my wife writes her reports with ChatGPT already.
Others are writing film scripts by tools.
Good times.
Should be extremely easy for AI to successfully detect hallucinated references as they are semi-structured data with an easily verifiable ground truth.
If I drop a loaded gun and it fires, killing someone, we don't go after the gun's manufacturer in most cases.
However, we’ll be left with AI written papers and no real way to determine if they’re based on reality or just a “stochastic mirror” (an approximate reflection of reality).
AI Overview: Based on the research, [Chen and N. Flammarion (2022)](https://gptzero.me/news/neurips/) investigate why Sharpness-Aware Minimization (SAM) generalizes better than SGD, focusing on optimization perspectives
The link is a link to the OP web page calling the "research" a hallucination.
But here's the thing: let's say you're an university or a research institution that wants to curtail it. You catch someone producing LLM slop, and you confirm it by analyzing their work and conducting internal interviews. You fire them. The fired researcher goes public saying that they were doing nothing of the sort and that this is a witch hunt. Their blog post makes it to the front page of HN, garnering tons of sympathy and prompting many angry calls to their ex-employer. It gets picked up by some mainstream outlets, too. It happened a bunch of times.
In contrast, there are basically no consequences to institutions that let it slide. No one is angrily calling the employers of the authors of these 100 NeurIPS papers, right? If anything, there's the plausible deniability of "oh, I only asked ChatGPT to reformat the citations, the rest of the paper is 100% legit, my bad".
I even know PIs who got fame and funding based on some research direction that supposedly is going to be revolutionary. Except all they had were preliminary results that from one angle, if you squint, you can envision some good result. But then the result never comes. That's why I say, "fake it, and never make it".
The best possible outcome is that these two purposes are disconflated, with follow-on consequences for the conferences and journals.
These clearly aren't being peer-reviewed, so there's no natural check on LLM usage (which is different than what we see in work published in journals).
Better detectors, like the article implies, won’t solve the problem, since AI will likely keep improving
It’s about the fact that our publishing workflows implicitly assume good faith manual verification, even as submission volume and AI assisted writing explode. That assumption just doesn’t hold anymore
A student initiative at Duke University has been working on what it might look like to address this at the publishing layer itself, by making references, review labor, and accountability explicit rather than implicit
There’s a short explainer video for their system: https://liberata.info/
It’s hard to argue that the current status quo will scale, so we need novel solutions like this.
The problem is consequences (lack of).
Doing this should get you barred from research. It won’t.
This says just as much about the humans involved.
But I saw it in Apple News, so MISSION ACCOMPLISHED!
220 is actually quite the deal. In fact, heavy usage means Anthropic loses money on you. Do you have any idea how much compute cost to offer these kind of services?
As we get more and more papers that may be citing information that was originally hallucinated in the first place we have a major reliability issue here. What is worse is people that did not use AI in the first place will be caught in the crosshairs since they will be referencing incorrect information.
There needs to be a serious amount of education done on what these tools can and cannot do and importantly where they fail. Too many people see these tools as magic since that is what the big companies are pushing them as.
Other than that we need to put in actual repercussions for publishing work created by an LLM without validating it (or just say you can’t in the first place but I guess that ship has sailed) or it will just keep happening. We can’t just ignore it and hope it won’t be a problem.
And yes, humans can make mistakes too. The difference is accountability and the ability to actually be unsure about something so you question yourself to validate.
When a reviewer is outgunned by the volume of generative slop, the structure of peer review collapses because it was designed for human-to-human accountability, not for verifying high-speed statistical mimicry. In these papers, the hallucinations are a dead giveaway of a total decoupling of intelligence from any underlying "self" or presence. The machine calculates a plausible-looking citation, and an exhausted reviewer fails to notice the "Soul" of the research is missing.
It feels like we’re entering a loop where the simulation is validated by the system, which then becomes the training data for the next generation of simulation. At that point, the human element of research isn't just obscured—it's rendered computationally irrelevant.
If we go back to Google, before its transformation into an AI powerhouse — as it gutted its own SERPs, shoving traditional blue links below AI-generated overlords that synthesize answers from the web’s underbelly, often leaving publishers starving for clicks in a zero-click apocalypse — what was happening?
The same kind of human “evaluators” were ranking pages. Pushing garbage forward. The same thing is happening with AI. As much as the human "evaluators" trained search engines to elevate clickbait, the very same humans now train large language models to mimic the judgment of those very same evaluators. A feedback loop of mediocrity — supervised by the... well, not the best among us. The machines still, as Stephen Wolfram wrote, for any given sequence, use the same probability method (e.g., “The cat sat on the...”), in which the model doesn’t just pick one word. It calculates a probability score for every single word in its vast vocabulary (e.g., “mat” = 40% chance, “floor” = 15%, “car” = 0.01%), and voilà! — you have a “creative” text: one of a gazillion mindlessly produced, soulless, garbage “vile bile” sludge emissions that pollute our collective brains and render us a bunch of idiots, ready to swallow any corporate poison sent our way.
In my opinion, even worse: the corporates are pushing toward “safety” (likely from lawsuits), and the AI systems are trained to sell, soothe, and please — not to think, or enhance our collective experience.
No one cares about citations. They are hallucinated because they are required to be present for political reasons, even though they have no relevance.
This has almost nothing to do with AI, and everything to do with a journal not putting in the trivial effort (given how much it costs to get published by them) required to ensure subject integrity. Yeah AI is the new garbage generator, but this problem isn't new, citation verification's been part of review ever since citations became a thing.
This would be a valuable research tool that uses AI without the hallucinations.
Many such cases of this. More than 100!
They claim to have custom detection for GPT-5, Gemini, and Claude. They're making that up!
Just ask authors to submit their bib file so we don't need to do OCR on the PDF. Flag the unknown citations and ask reviewers to verify their existence. Then contact authors and ban if they can't produce the cited work.
This is low hanging fruit here!
Detecting slop where the authors vet citations is much harder. The big problem with all the review rules is they have no teeth. If it were up to me we'd review in the open, or at least like ICLR. Publish the list of known bad actors and let is look at the network. The current system is too protective of egregious errors like plagiarism. Authors can get detected in one conference, pull, and submit to another, rolling the dice. We can't allow that to happen and we should discourage people from associating with these conartists.
AI is certainly a problem in the world of science review, but it's far from the only one and I'm not even convinced it's the biggest. The biggest is just that reviewers are lazy and/or not qualified to review the works they're assigned. It takes at least an hour to properly review a paper in your niche, much more when it's outside. We're over worked as is, with 5+ works to review, not to mention all the time we got to spend reworking our own works that were rejected due to the slot machine. We could do much better if we dropped this notion of conference/journal prestige and focused on the quality of the works and reviews.
Addressing those issues also addresses the AI issues because, frankly, *it doesn't matter if the whole work was done by AI, what matters is if the work is real.*