FRESH

Hacker News

Home

Removing newlines in FASTA file increases ZSTD compression ratio by 10x

278 points by bede

by jefftk

6 subcomments

The FASTA format looks like:
```
    > title
    bases with optional newlines
    > title
    bases with optional newlines
    ...
```
The author is talking about removing the non-semantic optional newlines (hard wrapping), not all the newlines in the file.
It makes a lot of sense that this would work: bacteria have many subsequences in common, but if you insert non-semantic newlines at effectively random offsets then compression tools will not be able to use the repetition effectively.

by felixhandte

1 subcomments

This is because Zstd's long-distance matcher looks for matching sequences of 64 bytes [0]. Because long matching sequences of the data will likely have the newlines inserted in different offsets in the run, this totally breaks Zstd's ability to find the long-distance match.
Ultimately, Zstd is a byte-oriented compressor that doesn't understand the semantics of the data it compresses. Improvements are certainly possible if you can recognize and separate that framing to recover a contiguous view of the underlying data.
[0] https://github.com/facebook/zstd/blob/v1.5.7/lib/compress/zs...
(I am one of the maintainers of Zstd.)

by mfld

2 subcomments

    Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression reducing compatibility somewhat.

Interesting. Any idea why this can't be stored in the metadata of the compressed file?

by ashvardanian

2 subcomments

Nice observation!
Took me a while to realize that Grace Blackwell refers to a person and not an Nvidia chip :)
I’ve worked with large genomic datasets on my own dime, and the default formats show their limits quickly. With FASTA, the first step for me is usually conversion: unzip headers from sequences, store them in Arrow-like tapes for CPU/GPU processing, and persist as Parquet when needed. It’s straightforward, but surprisingly underused in bioinformatics — most pipelines stick to plain text even when modern data tooling would make things much easier :(

by Aachen

1 subcomments

I've also noticed this. Zstandard doesn't see very common patterns
For me it was an increasing number (think of unix timestamps in a data logger that stores one entry per second, so you are just counting up until there's a gap in your data), in the article it's a fixed value every 60 bytes
Of course, our brains are exceedingly good at finding patterns (to the point where we often find phantom ones). I was just expecting some basic checks like "does it make sense to store the difference instead of the absolute value for some of these bytes here". Seeing as the difference is 0 between every 60th byte in the submitted article, that should fix both our issues
Bzip2 performed much better for me but it's also incredibly slow. If it were only the compressor, that might be fine for many applications, but also decompressing is an exercise in patience so I've moved to Zstandard at the standard thing to use

by semiinfinitely

7 subcomments

FASTA is a candidate for the stupidest file format ever invented and a testament to the massive gap in perceived vs actual programming ability of the average bioinformatician.

by leobuskin

1 subcomments

What about a specialized dict for FASTA? Shouldn't it increase ZSTD compression significantly?

by keketi

1 subcomments

When you know you're going to be compressing files of particular structure, it's often very beneficial to tweak compression algorithm parameters. In one case when dealing with CSV data, I was able to find a LZMA2 compression level, dictionary size and compression mode that yielded a massive speedup, uses 1/100th the memory and surprisingly even yields better compression ratios, probably from the smaller dictionary size. That's in comparison to the library's default settings.

by IshKebab

5 subcomments

Damn surely you stop using ASCII formats before your dataset gets to 2 TB??

by totalperspectiv

2 subcomments

Removing the wrapping newline from the FASTA/FASTQ convention also dramatically improves parsing perf when you don't have to do as much lookahead to find record ends.

by lutusp

0 subcomment

> I speculated that this poor performance might be caused by the newline bytes (0x0A) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching.
If the linefeeds were treated as semantic characters and not allowed to break the hash size, you would get similar results without pre-filtering and post-filtering. It occurs to me that this strategy is so obvious that there must be some reason it won't work.

by pkilgore

1 subcomments

To me the most interesting thing here isn't that you can compress something better by removing randomly-distributed semantically-meaningless information. It's why zstd --long does so much better than gzip when you do and the default does worse than gzip.
What lessons can we take from this?

by a_bonobo

0 subcomment

There's some discussion here about DNA-specific compression algorithms.
I thought I'd raise yesterday's HN discussion on 'The unreasonable effectiveness of modern sort algorithms' https://news.ycombinator.com/item?id=45208828
That blog post isn't about DNA per se, but it is about sorting data when you know there are only 4 numbers. I guess DNA has 5 - A,T,G,C,N the unknown base - but there's a huge space of DNA-specific compression research that outperforms ZSTD.

by diimdeep

2 subcomments

What's current way to accessibly process my 23andme raw data ? It's been synthesized decade ago and SNPedia and Promethease seems abandoned, so what's alternative if there is, and if there is none how we arrived to this?

by rini17

2 subcomments

This might in general be a good preprocessing step to check for punctuation repeating in fixed intervals and remove it, and restore after decompression.

by FL33TW00D

1 subcomments

Looking forward to the relegation of FASTQ and FASTA to the depths of hell where they belong. Incredibly inefficient and poorly designed formats.

by dekhn

0 subcomment

I've explored alternatives to FASTA and FASTQ but in most cases I found that simply not storing sequence data is the best option of all, but if I have to do it, columnar formats with compression are usually the best alternative when considering all of (my) the constraints.

by im3w1l

1 subcomments

As someone with an idle interest in data compression, ss it possible to download the original dataset somewhere to play around with? Or rather a like 20gb subset of it.

by meel-hd

0 subcomment

https://github.com/meel-hd/DNA

by nickdothutton

0 subcomment

How can we represent data or algos such that such optimisations before more obvious?

by Kim_Bruning

2 subcomments

Now I'm wondering why this works. DNA clearly has some interesting redundancy strategies. (it might also depend on genome?)