- Repo with demo video and benchmark:
https://github.com/microsoft/BitNet
"...It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption..."
https://arxiv.org/abs/2402.17764
by ilrwbwrkhv
4 subcomments
- This will happen more and more. This is why NVidia is rushing to get CUDA a software level lock-in otherwise their stock will go the way of Zoom.
by zamadatix
2 subcomments
- "Parameter count" is the "GHz" of AI models: the number you're most likely to see but least likely to need. All of the models compared (in the table on the huggingface link) are 1-2 billion parameters but the models range in actual size by more than a factor of 10.
- I think almost all the free LLMs (not AI) that you find on hf can 'run on CPUs'.
The claim here seems to be that it runs usefully fast on CPU.
We're not sure how accurate this claim is, because we don't know how fast this model runs on a GPU, because:
> Absent from the list of supported chips are GPUs [...]
And TFA doesn't really quantify anything, just offers: > Perhaps more impressively, BitNet b1.58 2B4T is speedier than other models of its size — in some cases, twice the speed — while using a fraction of the memory.
The model they link to is just over 1GB in size, and there's plenty of existing 1-2GB models that are quite serviceable on even a mildly-modern CPU-only rig.
- This is over a year old. The sky did not come down, everyone didn't switch to this in spite of the "advantages". If you look into why, you'll see that it does, in fact, affect the metrics, and some more than others, and there is no silver bullet.
- The pricing war will continue to rock bottom
- Why do they call it "1-bit" if it uses ternary {-1, 0, 1}? Am I missing something?
by nodesocket
1 subcomments
- There are projects working on distributed LLMs, such as exo[1]. If they can crack the distributed problem fully and get performance it’s a game changer. Instead of spending insane amounts on Nvidia GPUs, can just deploy commodity clusters of AMD EPYC servers with tons of memory, NVMe disks, and 40G or 100G networking which is vastly less expensive. Goodbye Nvidia AI moat.
[1] https://github.com/exo-explore/exo
by justanotheratom
2 subcomments
- Super cool. Imagine specialized hardware for running these.
- Is there a library to distill bigger models into BitNet?
by instagraham
1 subcomments
- > it’s openly available under an MIT license and can run on CPUs, including Apple’s M2.
Weird comparison? The M2 already runs 7 or 13gb LLama and Mistral models with relative ease.
The M-series and Macbooks are so ubiquitous that perhaps we're forgetting how weak the average CPU (think i3 or i5) can be.
by 1970-01-01
0 subcomment
- ..and eventually the Skynet Funding Bill was passed.