FRESH

Hacker News

Scientific production in the era of large language models [pdf]

62 points by nkko

by barishnamazov

2 subcomments

The key finding here is the reversal of the relationship between writing complexity and paper quality.
Traditionally, sophisticated writing correlated with higher-quality research (or at least higher status/effort). This paper argues that post-LLM, we are seeing a flood of manuscripts that use complex, polished language but contain substantively weaker scientific contribution.
They claim LLM adoption increases output by up to 89%, which is a massive productivity shock. If the cost of generating looks-like-science prose drops to near zero, the signal-to-noise ratio in peer review is going to crash. We are entering the era of the polished turd, and likely worse case of publish and perish [0].
[0] https://en.wikipedia.org/wiki/Publish_or_perish

by hirenj

2 subcomments

It is a huge worry for me that unless we decouple the publishing “system” from the career pathways (i.e., rewards), we are going to lose access to both the careers (to robot-weilding bullshitters) and even worse, the shared space where scientific communication took place.
Does anyone know of any writing on the network effects of the publishing system? What would happen if the actual value of the journals (of the little they provide!) were to go away?
The death of scientific twitter, and the failure to establish any replacement makes me worry that we won’t be able to coalesce around a replacement system. Obviously preprints play a role, but we really need our scientific communities to engage with them in a more serious way.

by felipeerias

0 subcomment

On many fields, the link between “writing papers” and “producing science” was already fraying before the arrival of LLMs.
We will have to find better ways to share and promote valuable research, before we all drown in the noise.

by conditionnumber

1 subcomments

Very cool appendix describing how they collected the data. I was kind of surprised to learn that they collected arXiv abstracts + metadata from Kaggle, but it definitely makes sense. I was also surprised that 6 years of SSRN papers was only ~1.3m documents. If you assume 20 pages/document and 400 words/page and 1.3 tokens/word, then it would only cost (ballpark) $1000 to pass the full corpus through the 4o-mini completions API. I think it would be really neat to build out a "Dataset Used", "Model Used" etc table for SSRN papers. I imagine more complicated questions would be harder to answer (because you might have to analyze non-text parts of the documents).

by hazrmard

0 subcomment

by ggm

0 subcomment

Really badly named article at source. Scientific PAPER production in the era of...