Base Language Models
This is a non-exhaustive list of large language models that I find interesting. I’ve just started building it so it’s very much incomplete. Contains only base models for now, not sure if/how I’ll include any finetunes.
Criteria for including a model:
- It needs to be interesting to me personally, not necessarily beat by 0.1% whatever llama finetune was on top of OpenLLM leaderboard at the time of its release.
- It needs to be available for commercial usage (or be really really cool, like BakLLaVA). I need to be able to use them to generate code/ideas/whatever that may or may not generate revenue. That excludes models like Orca-2 which can only be used for “non-commercial, non-revenue generating, research purposes”.
- I need to be able to run it locally, on a Mac M2 Max. This means either smaller models (13B tops), or something that runs in llama.cpp.
I’ll be referencing some comparisons as well:
- LMSYS Chatbot Arena, specifically its ELO ratings: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
- Open LLM Leaderboard, although it’s just so messy: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
- /u/WolframRavenwolf’s threads are gold: https://www.reddit.com/user/WolframRavenwolf/
Phi-2
Phi-2 is a 2.7 billion parameter Transformer-based model designed for tasks involving common sense reasoning, language understanding, logical reasoning, and Python code generation. It has been trained on a blend of NLP synthetic texts, filtered web data, and teaching-focused datasets. Plus, it can go head to head with other models that have less than 13 billion parameters.
- Interesting because: It’s tiny and seems quite capable! This means it can be further (cheaply) improved by finetuning, and will run on a lot of devices. Seems great for experimenting as well
- Released on: 2023-12-12.
- Parameters: 2.7B
- Context size: 2k
- License: MIT License (used to be non-commercial but they changed it, kudos to MS for that)
- Blog: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
- Model weights: https://huggingface.co/microsoft/phi-2
Mamba-3B-SlimPJ
Mamba-3B-SlimPJ is a state-space model developed as a collaboration between Together AI & Cartesia AI, rivaling the best Transformer architectures with linear scaling in sequence length and fast inference. It has been trained on 600B tokens from the SlimPajama dataset.
It matches the quality of other high-performing 3B parameter models but requires 17% fewer FLOPs. Its performance is validated against models like the BTLM-3B-8K and even some 7B Transformer models.
- Interesting because: Small, and does almost as well as StableLM-3B-4E1T which was trained on 7 times more tokens.
- Released on: 2023-12-12
- Parameters: 2.8 Billion
- Context size: 2k
- License: Apache 2.0
- Blog: https://www.together.ai/blog/mamba-3b-slimpj
- Model weights: https://huggingface.co/state-spaces/mamba-2.8b-slimpj
- GitHub repo: https://github.com/state-spaces/mamba
DeciLM-7B
DeciLM-7B is a language model with a capacity of 7 billion parameters, which is positioned as having high throughput and accuracy within its parameter class. The model is licensed under Apache 2.0 and has achieved an average score of 61.55 on the Open LLM Leaderboard.
Its architecture utilizes a feature called variable Grouped Query Attention, which is an evolution from Multi-Query Attention, designed to offer a more efficient balance between computational speed and model accuracy. This architectural decision was facilitated by an AutoNAC-driven design process.
The DeciLM-7B-instruct
finetune does worse than Mistral-7B-Instruct-v0.2
on /u/WolframRavenwolf’s tests.
- Interesting because: They claim it’s better than Mistral 7B, which is already an achievement. Seems like it’s faster, too.
- Released on: 2023-12-12
- Parameters: 7.04 Billion
- Context size: 8k
- License: Apache-2.0
- Blog: https://deci.ai/blog/introducing-decilm-7b-the-fastest-and-most-accurate-7b-large-language-model-to-date/
- Model weights:
- Base: https://huggingface.co/Deci/DeciLM-7B
- Instruct, finetuned using SlimOrca: https://huggingface.co/Deci/DeciLM-7B-instruct
- Finetuning script: https://colab.research.google.com/drive/1zsDQdlj-ry8MuP5E4qmpFeVf3_hRI1EL?usp=sharing
StripedHyena-7B
A model introduced by Together Research, showcasing a new architecture designed for long context, improved training, and inference performance over the traditional Transformer architecture. It combines multi-head, grouped-query attention and gated convolutions arranged in Hyena blocks. This model is suitable for processing long prompts and has been trained on sequences of up to 32k.
StripedHyena is the first model competitive with leading open-source Transformers for both short and long-context tasks. It introduces a hybrid architecture that includes state-space models for constant memory decoding and achieves low latency, faster decoding, and higher throughput. It also improves upon training and inference scaling laws compared to optimized Transformer models. Additionally, the model has been optimized with new model grafting techniques and trained on a mix of the RedPajama dataset and longer-context data.
- Interesting because: Seems to work well and it’s not based on the mighty Transformer. Huge context size.
- Released on: 2023-12-08
- Parameters: 7.65 Billion
- Context size: 128k (?)
- License: Apache-2.0
- Blog: https://www.together.ai/blog/stripedhyena-7b
- Model weights:
- Minimal implementation: https://github.com/togethercomputer/stripedhyena
Mistral-7B
Mistral 7B is a powerful and efficient language model known for outperforming more sizable models on various benchmarks. It has specialized attention mechanisms for improved performance and inference speed.
Mistral 7B can handle longer sequences at a smaller cost due to its unique Sliding Window Attention (SWA) and Grouped-query attention (GQA), which allows for faster inference times. Interestingly, it performs almost as well as much larger models, effectively demonstrating a better cost/performance balance. SWA is cool because it means that while Mistral only looks at the last 4k tokens of context, each of those tokens looked at the 4k before it and so on. See the discussion here.
- Interesting because: Ever since it was released it became the small model to beat
- Released on: 2023-09-27
- Parameters: 7.3 billion
- Context size: 8k
- License: Apache 2.0
- Blog: https://mistral.ai/news/announcing-mistral-7b/
- Model weights:
- Base: Mistral-7B-v0.1
- Instruct: Mistral-7B-Instruct-v0.1
Eagle-7B
I’ve been following RWKV for quite some time (large context size coupled with fast inference tend to get one’s attention) and Eagle-7B, their latest release, looks very nice.It’s a 7.52 billion parameter model based on the RWKV-v5 architecture, meaning it’s a linear transformer with significantly lower inference costs compared to traditional transformers.
It has been trained on 1.1 trillion tokens across more than 100 languages and aims to provide strong multi-lingual performance and reasonable English performance, approaching the benchmark set by other larger models.
- Interesting because: Its computation cost scales linearly and not quadratically! Also, good multi-lingual performance
- Released on: 2024-01-29
- Parameters: 7.52 billion
- Context size: Large ’n Confusing, see this too
- License: Apache 2.0
- Blog: https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers
- Model weights: https://huggingface.co/RWKV/v5-Eagle-7B
- Gradio Demo: https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2
- PIP package: https://pypi.org/project/rwkv/
- RWKV.cpp: https://github.com/saharNooby/rwkv.cpp
Mixtral 8x7B
A high-quality sparse mixture of experts model that outperforms Llama 2 70B and GPT3.5 on several benchmarks, with capabilities in handling multiple languages and performing well in code generation and instruction following.
The model’s ability to handle an extensive context of 32k tokens positions it for handling complex tasks that require processing large inputs. Mixtral is reported to excel not only in language tasks across English, French, Italian, German, and Spanish but also in specialized domains such as code generation.
Mistral used both supervised fine-tuning and direct preference optimization for the instruction-tuned variant, which achieves noteworthy results on MT-Bench, indicating its capacity for understanding and executing detailed instructions with precision.
- Interesting because: It’s as fast as a 13B model, even though it takes 3+ times as much memory. Highest ranked open model on LMSYS’ ChatBot Arena, on par with GPT-3.5
- Released on: 2023-12-11
- Parameters: 46.7B total (memory), 12.9B used per token (speed)
- Context size: 32k
- License: Apache 2.0
- Blog: https://mistral.ai/news/mixtral-of-experts/
- Model weights:
- /u/WolframRavenwolf tests for Mixtral-8x7B-Instruct-v0.1: https://www.reddit.com/r/LocalLLaMA/comments/18gz54r/llm_comparisontest_mixtral8x7b_mistral_decilm/
- In-depth performance comparisons between Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral. Mixtral seems to do a bit worse than 3.5 Turbo.
Yi
The Yi series models, developed by 01.AI, are a next-generation open-source bilingual large language models (LLMs).
- Interesting because: They’re quite strong (although not as strong as Mixtral), and they offer a 200k context variant.
- Released on: 2023-11-02 to 23
- Parameters: 6B and 34B respectively
- Context size: 4k for all except the -200k base models which have a 200k context size
- License: Yi Series Models Community License Agreement 2.1 (custom), with models being fully open for academic research and (apparently) free for commercial use once you apply for it.
- Model weights:
- Web: https://01.ai
- GitHub repo: https://github.com/01-ai/Yi
Multimodals (heh)
BakLLaVA-1
BakLLaVA-1 is a text generation model created by SkunkworksAI. It is based on the Mistral 7B model and enhanced with the LLaVA 1.5 architecture for multimodal capabilities. The model is designed to outperform Llama 2 13B on several benchmarks. It is open-source but uses certain non-commercially permissive data which the creators plan to address in an upcoming release (hello BakLLaVA-2!), which will utilize a larger, commercially viable dataset and a novel architecture.
- Interesting because: It can “see”! Plus it’s based on Mistral 7B so I expect it to be comparable to LLaVA 13B
- Parameters: 7.3 billion
- License: Well. They say it’s Apache 2.0, but since it’s based on the LLaVA dataset this means used to be not commercial, but the LLaVa people changed it in the meantime because life is change and nothing stays the same on the Internet
- Blog: https://twitter.com/skunkworks_ai/status/1713372586225156392
- Model weights: https://huggingface.co/SkunkworksAI/BakLLaVA-1
- Model Repository on GitHub: https://github.com/SkunkworksAI/BakLLaVA
- Running locally with llama.cpp
- GGUF files for llama.cpp including
mmproj-model-f16
: https://huggingface.co/mys/ggml_bakllava-1 - GGUF conversion instructions here: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava
- Inference tutorial: https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.html
- Run in llama.cpp server:
./server -m models/BakLLaVA-1/ggml-model-q5_1.gguf -c 8192 --mmproj models/BakLLaVA-1/mmproj-model-f16.gguf
- GGUF files for llama.cpp including
Yi-VL
Yi Visual Language (Yi-VL) model is an open-source, multimodal model capable of content comprehension, recognition, and multi-round conversations about images in both English and Chinese. It’s the multimodal version of the Yi Large Language Model (LLM) series. Available in 6B and 34B(!) flavors.
- Interesting because: The first open-weights 34B vision language model available!
- Released on: 2024-01-21
- Parameters: 6 and 34 billion
- Context size: 4k (since it’s based on Yi-Chat)
- License: Yi Series Models Community License Agreement 2.1 (custom), with models being fully open for academic research and (apparently) free for commercial use once you apply for it.
- Model weights:
- Web: https://01.ai
- GitHub repo: https://github.com/01-ai/Yi
LLaVA-1.6
LLaVA-1.6 is a large multimodal model (LMM) with improvements in reasoning, optical character recognition (OCR), and world knowledge over its predecessor, LLaVA-1.5. It’s designed for high-resolution visual inputs and has been optimized for better visual reasoning, conversation, and efficient deployment with SGLang.
It offers state-of-the-art (SoTA) performance on several benchmarks, even surpassing commercial models like Gemini Pro in some tests. Four models are available, two of them based on Vicuna, one on Mistral (does this make BakLLaVA-1 obsolete? 🤔), and one on Nous-Hermes-2-Yi :).
- Interesting because: Second 34B vision language model :). Also, it appears to do better than Yi-VL in tests.
- Released on: 2024-01-30
- Parameters: 7, 13, and 34 billion
- Context size: 8k for Mistral, 4k for Nous-Hermes-2-Yi, no idea for Vicuna
- License: Apache-2.0
- Blog: https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
- Model weights:
- WIP pull request for llama.cpp: https://github.com/ggerganov/llama.cpp/pull/5267
Watchlist
Models I haven’t had the time to look at/process, but sound interesting.
Cohere’s Command R & Command R+
Strong models. Non-commercial license. 35B and 102B parameters respectively. Released on 11 March 2024 and 04 April 2024 respectively. 128k tokens context window. They do well on Chatbot Arena
- Prompt format: https://docs.cohere.com/docs/prompting-command-r
- Command R: https://huggingface.co/CohereForAI/c4ai-command-r-v01
- Command R Blog: https://txt.cohere.com/command-r/
- Command R GGUF Quantizations: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF
- Command R GGUF Quantizations Reddit Thread: https://www.reddit.com/r/LocalLLaMA/comments/1bft5qd/commandr_35b_open_weights_model_has_ggufllamacpp/
- Command R+: https://huggingface.co/CohereForAI/c4ai-command-r-plus
- Command R+ Blog: https://txt.cohere.com/command-r-plus-microsoft-azure/
- Llama.cpp commit adding support for Command R: https://github.com/ggerganov/llama.cpp/commit/12247f4c69a173b9482f68aaa174ec37fc909ccf
- llama-cpp-python issue: https://github.com/abetlen/llama-cpp-python/issues/1279
Starling-LM-7B-Beta
Strong 7B model. Placed just under Claude-2.1 in the LMSYS Chatbot Arena, which is crazy.
Mi(s|x)tral 8x22B
No info yet, except a magnet link (and the fact that most people won’t be able to run this locally due to its size)
- Blog post: https://mistral.ai/news/mixtral-8x22b/
- Instruct: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
- Base: https://huggingface.co/mistralai/Mixtral-8x22B-v0.1
Qwen 1.5
Better than Qwen 1 (although it will still revert to Chinese randomly in multi-turn dialogues). Works in llama.cpp.
- Blog post: https://qwenlm.github.io/blog/qwen1.5/
- GitHub: https://github.com/QwenLM/Qwen1.5
- 32B: https://huggingface.co/Qwen/Qwen1.5-32B
- 32B-Chat GGUF: https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF
- License: https://huggingface.co/Qwen/Qwen1.5-32B/blob/main/LICENSE
XGen-MM-phi3-mini-instruct-r-v1
Multimodal, small, non-commercial
Yi 1.5
https://huggingface.co/collections/01-ai/yi-15-2024-05-663f3ecab5f815a3eaca7ca8
Mistral 0.3
Function calling!
https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
Qwen 2
New Mistral-sized MOE.
Qwen/Qwen2-72B & Qwen/Qwen2-57B-A14B
Codestral Mamba 7B
7B, Mamba2 architecture, Apache 2 license.