Chances are you have heard of or tried llama.cpp this year. Running a local large language model, specifically on a Mac is in thanks to GGML, the C/C++ ML tensor library that Llama.cpp uses.
CPU vs GPU
In order to keep up with the demands of ML applications, developers gravitate towards using high powered GPUs to build and infer. This obviously unlocks so much in the creation of these models, but also creates a void for the regular developer with just a Mac.
To put it in numbers, on HuggingFace it costs roughly 1 dollar per hour to use a GPU, versus a CPU instance which costs roughly 2 cents, if not free.
I’ve been able to run a 13B GGML Llama model comfortably on a CPU upgraded HuggingFace space. 7B GGML will run on the free CPU tier!
Feel free to fork my HuggingFace Llama-cpp-python server and swap out the models you want to use. I have a CUDA powered Space as well, I’ve been able to get a 33B model to run on a A100. The performance keeps improving.
Quantization
In order to run a LLM on your computer it needs to be quantized. The weights are compressed, and the memory requirements decrease to where it can be run within your computer’s memory limits. For example 16GB of ram can likely run a 7B model comfortably.
The natural question is whether the quality degrades with this compression, and this is where benchmarking and perplexity testing comes into play. From my research the quality change is minimal. For example a 30B quantized model will still greatly outperform a 13B un-quantized.
I’ve been working on a pull request with the lm-eval library which houses the standard LLM benchmark suite. There are plenty of other ways to benchmark a GGML model, including within llama.cpp (including Jeopardy).
The process of converting a model to GGML is straight forward, but discovering models is a unique challenge that Tom Jobbins, The Bloke has taken on. I'm typically going through his list to discover new models, and the number of them continues to increase!
Apple GPU support (Metal)
The Llama.cpp project continued to incorporate GPU support, and as of a few months ago Apple Metal Performance Sliders (MPS) support was added. Running locally already has its advantages, but now the GPU makes it order of magnitudes faster.
Metal is now supported by default (no flag required). You will want to include the -ngl for the amount of VRAM to offload.
GGML all the things
Check out the GGML github to contribute and see the other projects incorporating GGML currently.
You’ll notice many open source LLMs (not Llama) are included there. The pattern you will notice is a new “.cpp” project similar to Llama. For example Meta’s “Segment Anything” has recently released a GGML version, sam.cpp.
Its worth noting that a lot of the optimizations, like Metal support are currently only for Llama. This may change over time, the library recently shifted to a new format (gguf).
There are so many other ML libraries that could benefit from GGML! MusicGen for example would be great. Thanks to Georgi Gerganov and team for making ML more accessible.