FRESH

Hacker News

Home

I made my own Git

378 points by TonyStr

by nasretdinov

7 subcomments

Nice work! On a complete tangent, Git is the only SCM known to me that supports recursive merge strategy [1] (instead of the regular 3-way merge), which essentially always remembers resolved conflicts without you needing to do anything. This is a very underrated feature of Git and somehow people still manage to choose rebase over it. If you ever get to implementing merges, please make sure you have a mechanism for remembering the conflict resolution history :).
[1] https://stackoverflow.com/questions/55998614/merge-made-by-r...

by darkryder

3 subcomments

Great writeup! It's always fun to learn the details of the tools we use daily.
For others, I highly recommend Git from the Bottom Up[1]. It is a very well-written piece on internal data structures and does a great job of demystifying the opaque git commands that most beginners blindly follow. Best thing you'll learn in 20ish minutes.
1. https://jwiegley.github.io/git-from-the-bottom-up/

by teiferer

5 subcomments

If you ever wonder how coding agents know how to plan things etc, this is the kind of article they get this training from.
Ends up being circular if the author used LLM help for this writeup though there are no obvious signs of that.

by gkbrk

0 subcomment

CodeCrafters has an amazing "Build your own Git" [1] tutorial too. Jon Gjengset has a nice video [2] doing this challenge live with Rust.
[1]: https://app.codecrafters.io/courses/git/overview
[2]: https://www.youtube.com/watch?v=u0VotuGzD_w

by brendoncarroll

2 subcomments

Me too. Version control is great, it should get more use outside of software.
https://github.com/gotvc/got
Notable differences: E2E encryption, parallel imports (Got will light up all your cores), and a data structure that supports large files and directories.

by p4bl0

3 subcomments

Nice post :). It made me think of ugit: DIY Git in Python [1] which is still by far my favorite of this kind of posts. It really goes deep into Git internals while managing to stay easy to follow along the way.
[1] https://www.leshenko.net/p/ugit/

by sluongng

0 subcomment

Zstd dictionary compression is essentially how Meta's Mercurial fork (Sapling VCS) stores blobs https://sapling-scm.com/docs/dev/internals/zstdelta. The source code is available in GitHub if folks want to study the tradeoffs vs git delta-compressed packfiles.
I think theoratically, Git delta-compression is still a lot more optimized for smaller repos. But for bigger repos where sharding storaged is required, path-based delta dictionary compression does much better. Git recently (in the last 1 year) got something called "path-walk" which is fairly similar though.

by sublinear

0 subcomment

> If I were to do this again, I would probably use a well-defined language like yaml or json to store object information.
I know this is only meant to be an educational project, but please avoid yaml (especially for anything generated). It may be a superset of json, but that should strongly suggest that json is enough.
I am aware I'm making a decade old complaint now, but we already have such an absurd mess with every tool that decided to prefer yaml (docker/k8s, swagger, etc.) and it never got any better. Let's not make that mistake again.
People just learned to cope or avoid yaml where they can, and luckily these are such widely used tools that we have plenty of boilerplate examples to cheat from. A new tool lacking docs or examples that only accepts yaml would be anywhere from mildly frustrating to borderline unusable.

by oldestofsports

2 subcomments

Nice job, great article!
I had a go at it as well a while back, I call it "shit" https://github.com/emanueldonalds/shit

by temporallobe

1 subcomments

Reminds me of when I tried to invent a SPA framework. So much hidden complexity I hadn’t thought of and I found myself going down rabbit holes that I am sure the creators of React and Angular went down. Git seems to be like this and I am often reminded of how impressive it is at hiding underlying complexity.

by igorw

1 subcomments

Random but y'all might enjoy. Git client in PHP, supports reading packfiles, reftables, diff via LCS. Written by hand.
https://github.com/igorwwwwwwwwwwwwwwwwwwww/gipht-horse

by sneela

1 subcomments

> If you want to look at the code, it's available on github.
Why not tvc-hub :P
Jokes aside, great write up!

by KolmogorovComp

0 subcomment

It’s really a shame git storage use files as the unit for storage. That’s what makes it improper for usage with many of small files, or large files.
Content-based chunking like Xethub uses really should become the default. It’s not like it’s new either, rsync is based on it.
https://huggingface.co/blog/xethub-joins-hf

by h1fra

0 subcomment

Learning git internals was definitely the moment it became clear to me how efficient and smart git is.
And this way of versionning can be reused in other fields, as soon as have some kind of graph of data that can be modified independently but read all together then it makes sense.

0 subcomment

by kgeist

3 subcomments

>The hardest part about this project was actually just parsing.
How about using sqlite for this? Then you wouldn't need to parse anything, just read/update tables. Fast indexing out of the box, too.

by eru

1 subcomments

> These objects are also compressed to save space, so writing to and reading from .git/objects/ will always involve running a compression algoritm. Git uses zlib to compress objects, but looking at competitors, zstd seemed more promising:
That's a weird thing to put so close to the start. Compression is about the least interesting aspect of Git's design.

by astinashler

2 subcomments

Does this git include empty folder? I always annoy that it's not track empty folder.

by heckelson

1 subcomments

gentle reminder to set your website's `<title>` to something descriptive :)

by mg794613

0 subcomment

"Though I suck at it, my go-to language for side-projects is always Rust"
Hmm, dont be so hard on yourself!
proceeds to call ls from rust
Ok nevermind, although I dont think rust is the issue here.
(Tony I'm joking, thanks for the article)

by bryan2

0 subcomment

Ftr you can make repos with sha256 now.
I wonder if signing sha-1 mitigates the threat of using an outdated hash.

by athrowaway3z

1 subcomments

I do wonder if the compression step makes sense at this layer instead of the filesystem layer.

by direwolf20

0 subcomment

Cool. When you reimplement something, it forces you to see the fractal complexity of it.

by jrockway

2 subcomments

sha256 is a very slow algorithm, even with hardware acceleration. BLAKE3 would probably make a noticeable performance difference.
Some reading from 2021: https://jolynch.github.io/posts/use_fast_data_algorithms/
It is really hard to describe how slow sha256 is. Go sha256 some big files. Do you think it's disk IO that's making it take so long? It's not, you have a super fast SSD. It's sha256 that's slow.

by holoduke

0 subcomment

I wonder if in the near future there will be no tools anymore in the sense we know it. you will maybe describe the tool you need and its created on the fly.

by prakhar1144

0 subcomment

I was also playing around with the ".git" directory - ended up writing:
"What's inside .git ?" - https://prakharpratyush.com/blog/7/

by lasgawe

0 subcomment

nice work! This is one of the best ways to deeply learn something, reinvent the wheel yourself.

by ofou

0 subcomment

btw, you can change the hashing algorithm in git easily

by smangold

0 subcomment

Tony nice work!

by b1temy

1 subcomments

Nice work, it's always interesting to see how one would design their own VCS from scratch, and see if they fall into problems existing implementations fell into in the past and if the same solution was naturally reached.
The `tvc ls` command seems to always recompute the hash for every non-ignored file in the directory and its children. Based on the description in the blog post, it seems the same/similar thing is happening during commits as well. I imagine such an operation would become expensive in a giant monorepo with many many files, and perhaps a few large binary files thrown in.
I'm not sure how git handles it (if it even does, but I'm sure it must). Perhaps it caches the hash somewhere in the `.git`directory, and only updates it if it senses the file hash changed (Hm... If it can't detect this by re-hashing the file and comparing it with a known value, perhaps by the timestamp the file was last edited?).
> Git uses SHA-1, which is an old and cryptographically broken algorithm. This doesn't actually matter to me though, since I'll only be using hashes to identify files by their content; not to protect any secrets
This _should_ matter to you in any case, even if it is "just to identify files". If hash collisions (See: SHAttered, dating back to 2017) were to occur, an attacker could, for example, have two scripts uploaded in a repository, one a clean benign script, and another malicious script with the same hash, perhaps hidden away in some deeply nested directory, and a user pulling the script might see the benign script but actually pull in the malicious script. In practice, I don't think this attack has ever happened in git, even with SHA-1. Interestingly, it seems that git itself is considering switching to SHA-256 as of a few months ago https://lwn.net/Articles/1042172/
I've not personally heard of the process of hashing to also be known as digesting, though I don't doubt that it is the case. I've mostly familiar of the resulting hash being referred to as the message digest. Perhaps it's to differentiate between the verb 'hash' (the process of hashing) with the output 'hash' (the ` result of hashing). And naming the function `sha256::try_digest`makes it more explicit that it is returning the hash/digest. But it is a bit of a reach, perhaps that are just synonyms to be used interchangeably as you said.
On a tangent, why were TOML files not considered at the end? I've no skin in the game and don't really mind either way, but I'm just curious since I often see Rust developers gravitate to that over YAML or JSON, presumably because it is what Cargo uses for its manifest.
--
Also, obligatory mention of jujutsu/jj since it seems to always be mentioned when talking of a VCS in HN.

by quijoteuniv

0 subcomment

Now … if you reinvent Linux you are closer to be compared to LT

by smekta

0 subcomment

...with blackjacks, and hookers

by jonny_eh

0 subcomment

Why introduce yet another ignore file? Can you have it read .gitignore if .tvcignore is missing?

by black_13

0 subcomment

[dead]