FRESH

Hacker News

Home

HTML as an Accessible Format for Papers (2023)

260 points by el3ctron

by dginev

1 subcomments

Hi, an arXiv HTML Papers developer here.
As a very brief update - we are pending a larger update.
You will spot many (many) issues with our current coverage and fidelity of the paper rendering. When they jump at you, please report them to us. All reports from the last 2 years have landed on github. We have made a bit of progress since, but there are (a lot of) more low-hanging fruit to pick.
Project issues:
https://github.com/arXiv/html_feedback/issues/
The main bottleneck at the moment is developer time. And the main vehicle for improvements on the LaTeX side of things continues to be LaTeXML. Happy to field any questions.

by RandyOrion

1 subcomments

For arXiv papers, I prefer HTML format much more than PDF format.
Compared to PDF format, HTML format is much more accessible because of browsers. Basically I can reuse my browser extensions to do anything I like without hassle, like translation, note taking, sending texts to LLMs, and so on.
For now, arXiv offers two HTML services: the default one in https://arxiv.org/html/xxxx.xxxxx , and the alternative one in https://ar5iv.labs.arxiv.org/html/xxxx.xxxxx , here 'x' is a placeholder for a number or digit.
The most glaring problem of the default HTML service is the coverage of papers. Sometimes it just doesn't work, e.g., https://arxiv.org/html/2505.06708 . The solution may be switch to alternative HTML service, e.g., https://ar5iv.labs.arxiv.org/html/2505.06708 .
Note that alternative HTML service also has coverage problem. Sometimes both HTML services fail, e.g. https://arxiv.org/abs/2511.22625 .

by ComputerGuru

6 subcomments

If the Unicode consortium would spend less time and effort on emoji and more on making the most common/important mathematical symbols and notations available/renderable in plain text, maybe we could move past the (LA)TeX/PDF marriage. OpenType and TrueType now (edit: for well over a decade, actually) support the necessary conditional rendering required to perform complicated rendering operations to get sequences of Unicode code points to display in the way needed (theoretically, anyway) and with fallback missing-glyph-only font family substitution support available pretty much everywhere allowing you to seamlessly display symbols not in your primary font from a fallback asset (something like Noto, with every Unicode symbol supported by design, or math-specific fonts like Cambria Math or TeX Gyre, etc), there are no technical restrictions.
I’ve actually dug into this in the past and it was never lack of technical ability that prevented them from even adding just proper superscript/subscript support before, but rather their opinion that this didn’t belong in the symbolic layer. But since emoji abuse/rely on ZWJ and modifiers left and right to display in one of a myriad of variations, there’s really no good reason not to allow the same, because 2 and the squares symbol are not semantically the same (so it’s not a design choice).
An interesting (complete) tangent is that Gemini 3 Pro is the only model I’ve tested (I do a lot of math-related stuff with LLMs) that absolutely will not under any circumstances respect (system/user) prompt requests to avoid inline math mode (aka LATeX) in the output, regardless of whether I asked for a blanket ban on TeX/MathJax/etc or when I insisted that it use extended unicode codes points to substitute all math formula rendering (I primarily use LLMs via the TUI where I don’t have MathJax support, and as familiar as I once was with raw TeX mathematical notations and symbols, it’s still quite easy to confuse unrendered raw output by missing something if you’re not careful). I shared my experiment and results here – Gemini 3 Pro would insist on even rendering single letter constants or variables as $k$ instead of just k (or k in markdown italics, etc) no matter how hard I asked it not to (which makes me think it may have been overfit against raw LATeX papers, and is also an interesting argument in favor of the “VL LLMs are the more natural construct”): https://x.com/NeoSmart/status/1995582721327071367?s=20

by ForceBru

2 subcomments

Is this new or somehow updated? HTML versions of papers have been available for several years now.
EDIT: indeed, it was introduced in 2023: https://blog.arxiv.org/2023/12/21/accessibility-update-arxiv...

by DominikPeters

1 subcomments

As an arXiv author who likes using complicated TeX constructions, the introduction of HTML conversion has increased my workload a lot trying to write fallback macros that render okay after conversion. The conversion is super slow and there is no way to faithfully simulate it locally. Still I think it's a great thing to do.

by ekjhgkejhgk

3 subcomments

I wish epub was more common for papers. I have no idea if there's any real difficulties with that, or just not enough demand.

by el3ctron

1 subcomments

Accessibility barriers in research are not new, but they are urgent. The message we have heard from our community is that arXiv can have the most impact in the shortest time by offering HTML papers alongside the existing PDF.

by Barbing

0 subcomment

>Did you know that 90% of submissions to arXiv are in TeX format, mostly LaTeX? That poses a unique accessibility challenge: to accurately convert from TeX—a very extensible language used in myriad unique ways by authors—to HTML, a language that is much more accessible to screen readers and text-to-speech software, screen magnifiers, and mobile devices.
Challenging. Good work!

by leobg

0 subcomment

It must have been around 1998. I was editor of our school’s newspaper. We were using Corel Draw. At some point, I proposed that we start using HTML instead. In the end, we decided against it, and the reasons were the same that you can read here in the comments now.

by zipy124

0 subcomment

The biggest issue with papers for me today is that they don't allow videos as anything other than supplemental materials to be downloaded, or linking to a web-page that has them. I want to embed gif's or videos in my papers directly!

by percentcer

2 subcomments

Dumb question but what stops browsers from rendering TeX directly (aside from the work to implement it)? I assume it's more than just the rendering

by sundarurfriend

0 subcomment

[Sept 2023] as per the wayback machine.

by sega_sai

1 subcomments

Unfortunately I didn't see the recommendation there on what can be done for old papers. I checked, and only my papers after 2022 have an HTML version. I wish they'd make some kind of 'try html' button for those.

by notorandit

0 subcomment

Thee problem is the viewer, not the format. We are talking about accessibility and scientific papers, where fancy animations and transitions are not core features.
LaTeX and TeX are the de facto standard for this context and converting all existing documents is a lot of work and energy to be spent for basically little gain, if any.

by jas39

1 subcomments

Pandoc can convert to svg. It can then be inlined in html. Looks just like latex, though copy/paste isn't very useful

by nateroling

5 subcomments

Seeing the Gemini 3 capabilities, I can imagine a near future where file formats are effectively irrelevant.

by ashleyn

1 subcomments

Can't help but wonder if this was motivated in part by people feeding papers into LLMs for summary, search, or review. PDF is awful for LLMs. You're effectively pigeonholed into using (PAYING for) Adobe's proprietary app and models which barely hold a candle to Gemini or Claude. There are PDF-to-text converters, but they often munge up the formatting.

by billconan

5 subcomments

I don't think HTML is the right approach. HTML is better than PDF, but it is still a format for displaying/rendering.
the actual paper content format should be separated from its rendering.
i.e. it should contain abstract, sections, equations, figures, citations etc. but it shouldn't have font sizes, layout etc.
the viewer platforms then should be able to style the content differently.

by chr15m

0 subcomment

Wish I could upvote this harder. Thank you arXiv!

by constantcrying

0 subcomment

Reading this thread many people do not seem to understand what to the problem even is. What researchers writing Papers want is a low effort/high flexibility way to write documents (Nobody wants to write their paper in HTML). For a paper to be printed it needs to be in some printable format, like PDF. To provide accessibility and accommodate the changing ways papers are read, which is increasingly online, HTML is also a desirable output.
What really is needed is a markup language which natively can target both PDF and HTML. This is something typst is working on, but I am not aware of any other project, which either comes close to the features of LaTeX or supports both target formats.
To me this is the only reasonably way to address the accessibility and usability issues around Papers. Have one markup, with sufficient accessibility features, which simultaneously targets HTML and PDF.

by teddy-smith

6 subcomments

It's extremely easy to convert HTML/CSS to a PDF with the print to PDF feature of the browser.
All papers should be in HTML/CSS or Tex then just simply converted to PDF.
Why are we even talking about this?

by _dain_

1 subcomments

Wasn't the World Wide Web invented at CERN specifically for sharing scientific papers? Why are we still using PDFs at all?

by cubefox

0 subcomment

This is not new, the title should say (2023). They have shipped the HTML feature with "experimental" flag for two years now, but I don't know whether there is even any plan to move out of the experimental phase.
It's not much of an "experiment" if you don't plan to use some experimental data to improve things somehow.

by lalithaar

0 subcomment

I was reading through this article too, glad to have found it on here

by rootnod3

2 subcomments

Maybe unpopular, but papers should be in n markdown flavor to be determined. Just to have them more machine readable.

by vatsachak

4 subcomments

Why do we like HTML more than pdfs?
HTML rendering requires you to be connected to the internet, or setting up the images and mathJax locally. A PDF just works.
HTML obviously supports dynamic embedding, such as programs, much better but people just usually post a github.io page with the paper.