Recently, I was curious to see how easy it would be to run run Llama2 on my MacBook Pro M2, given the impressive amount of memory it makes available to both CPU and GPU. This led me to the excellent llama.cpp, a project focused on running simplified versions of the Llama models on both CPU and GPU.

The process felt quite straightforward except for some instability in the llama.cpp repo just as I decided to try it out, and which has been fixed in the meantime. Incidentally, this prompted me to document the whole process, just in case I want to do it again in the future.

Here’s what I did:

  1. Signed up for access to Llama 2 and Code Llama here
  2. Shortly after, I received an email with download instructions and followed them for the 7b models. Keep in mind that each of the 7b models weighs about 13GB and it only goes up from there (the 70b models are like 137GB each, good luck downloading them on wifi)
  3. The download link is only valid for 24 hours, which is a pity and also annoying. BUT. One can request access to Meta Llama 2’s repository on HuggingFace using an account that has the same email one has used to request access at step 1. Then, one can download the models all one wants, for however long HuggingFace will host them. Wink.
  4. Cloned llama.cpp, specifically this commit (release b1221)
  5. llama.cpp needs to be built before using it, so cd llama.cpp && make and wait for it. Unlike in previous llama.cpp versions, Metal is now enabled by default when compiling, so there’s no need to worry about setting LLAMA_METAL=1 or stuff like that.
  6. Now, assuming the Llama repo is cloned, named llama, and has the same parent as llama.cpp
cp ../llama/tokenizer.model ./models
cp ../llama/tokenizer_checklist.chk ./models

MODEL="llama-2-7b-chat"
# directly moving the model to save disk space
mv ../llama/${MODEL} ./models
  1. In the llama.cpp folder, using conda,
conda create -n llama-cpp python=3.10 -y
conda activate llama-cpp
pip install -r requirements.txt
  1. The fun part is converting the original Llama2 models to llama.cpp’s GGUF format. You can do this using the commands below, but whatever you do, make sure you have enough disk space – you will need an extra 130% of the original model size, that means about 17GB for the 7b models. This table is a good reference for a model’s size on disk once converted. This outdated table is a good reference as well, when comparing b/w different quantization methods.
python convert.py models/${MODEl}
./quantize ./models/${MODEL}/ggml-model-f16.gguf \
           ./models/${MODEL}/ggml-model-q4_0.gguf q4_0
  1. A few ways to prompt, see this reference for more examples.
# basic completion, limited to 256 characters.
./main -m ./models/${MODEL}/ggml-model-q4_0.gguf \
    --n-predict 128 --prompt "Fuzzy Wuzzy was"

# disable Metal acceleration, just to see how much slower it would run
# on an M2 Max it goes down from 63 tokens/second with Metal acceleration to 26 tokens/second without Metal
./main -m ./models/${MODEL}/ggml-model-q4_0.gguf \
    --n-predict 128 --prompt "Fuzzy Wuzzy was" --n-gpu-layers 0

# interactive prompting, with increased context size
./main -m ./models/${MODEL}/ggml-model-q4_0.gguf \
    --n-predict 256 --ctx-size 2048 --interactive --interactive-first \
    --color --reverse-prompt "User:" \
    --prompt "User: Who is Fuzzy Wuzzy? ASSISTANT: Fuzzy Wuzzy was"

One time I got this gem:

ASSISTANT: I apologize, but I cannot provide information on Fuzzy Wuzzy’s personal preferences or any other details that might be considered offensive or insensitive. It’s important to be respectful of all cultures and avoid perpetuating harmful stereotypes or caricatures.

If Skynet gains sentience anytime soon, my sincere hope is that it won’t be as moralizing as this guy. 🤨

  1. BONUS: you can save yourself some time and diskspace by directly downloading one of TheBloke’s pre-converted GGUF models instead of the official models. For example curl https://huggingface.co/TheBloke/WizardLM-13B-V1.1-GGUF/resolve/main/wizardlm-13b-v1.1.Q4_K_M.gguf --location --output ./models/WizardLM-13B-V1.1/ --parallel

  2. BONUS #2: You can run a very rough web UI with ./server -m ./models/${MODEL}/ggml-model-q4_0.gguf