FRESH

Hacker News

Home

Show HN: Data Engineering Book – An open source, community-driven guide

246 points by xx123122

by fudged71

0 subcomment

Thank you so much for this book! I'm finding the translation is very high quality.
I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.
I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.

by esafak

1 subcomments

I'd have titled the submission 'Data Engineering for LLMs...' as it is focused on that.

by hliyan

1 subcomments

I'm not sure whether this is an artefact of translation, but things like this don't inspire confidence:
> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure
https://github.com/datascale-ai/data_engineering_book/blob/m...
Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...
Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...

by cpard

0 subcomment

It's important in a book treating an emerging field (data eng for LLMs) to mention emerging categories related to it such as storage formats purpose built for the full ML lifecycle.
Lance[1] (the format, not just LanceDB) is a great example, where you have columnar storage optimized for both analytical operations and vector workloads together with built-in versioning for dataset iteration.
Plus (very important) random access, which is important for stuff like sampling and efficient filtering during curation but also for working with multimodal data, e.g. videos.
Lance is not alone, vortex[2] is another one, nimble[3] from Meta yet another one and I might be missing a few more.
[1] https://github.com/lance-format/lance [2] https://vortex.dev [3] https://github.com/facebookincubator/nimble

by joshuaissac

2 subcomments

English version: https://github.com/datascale-ai/data_engineering_book/blob/m...

by osamabinladen

2 subcomments

this is great and i bookmarked it so i can read it later. i’m just curious though, was the readme written by chatgpt? i can’t tell if im paranoid thinking everything is written by chatgpt

by baalimago

0 subcomment

> "Data is the new oil, but only if you know how to refine it."
Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?
[0]: https://en.wikipedia.org/wiki/Petroleum

by 13pixels

2 subcomments

The 'Vector DB vs Keyword Search' section caught my eye. In your testing for RAG pipelines, where do you draw the line?
We've found keyword search (BM25) often beats semantic search for specific entity names/IDs, while vectors win on concepts. Do you cover hybrid search patterns/re-ranking in the book? That seems to be where most production systems end up.

by guillem_lefait

1 subcomments

The figures in the different chapters are in english (it's not the case for the image in README_en.md).

0 subcomment

by alexott

1 subcomments

Parquet alone is not for modern data engineering. Delta, Iceberg should be in the list

0 subcomment

by xx123122

0 subcomment

[dead]

0 subcomment

by dvrp

0 subcomment

If you are interested in (2026-)internet scale data engineering challenges (e.g. 10-100s of petabyte processing) challenges and pre-training/mid-training/post-training scale challenges, please send me an email to d+data@krea.ai !

1 subcomments

by rafavargascom

3 subcomments

谢谢
How is possible a Chinese publication gets to the top in HN?

by MUSTANG303

0 subcomment

[dead]