Vlad's Awesome Generative AI Compendium

Base Language Models

This is a non-exhaustive list of large language models that I find interesting. I’ve just started building it so it’s very much incomplete. Contains only base models for now, not sure if/how I’ll include any finetunes.

Criteria for including a model:

It needs to be interesting to me personally, not necessarily beat by 0.1% whatever llama finetune was on top of OpenLLM leaderboard at the time of its release.
It needs to be available for commercial usage (or be really really cool, like BakLLaVA). I need to be able to use them to generate code/ideas/whatever that may or may not generate revenue. That excludes models like Orca-2 which can only be used for “non-commercial, non-revenue generating, research purposes”.
I need to be able to run it locally, on a Mac M2 Max. This means either smaller models (13B tops), or something that runs in llama.cpp.

I’ll be referencing some comparisons as well:

LMSYS Chatbot Arena, specifically its ELO ratings: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Open LLM Leaderboard, although it’s just so messy: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
/u/WolframRavenwolf’s threads are gold: https://www.reddit.com/user/WolframRavenwolf/

Phi-2

Phi-2 is a 2.7 billion parameter Transformer-based model designed for tasks involving common sense reasoning, language understanding, logical reasoning, and Python code generation. It has been trained on a blend of NLP synthetic texts, filtered web data, and teaching-focused datasets. Plus, it can go head to head with other models that have less than 13 billion parameters.

Interesting because: It’s tiny and seems quite capable! This means it can be further (cheaply) improved by finetuning, and will run on a lot of devices. Seems great for experimenting as well
Released on: 2023-12-12.
Parameters: 2.7B
Context size: 2k
License: MIT License (used to be non-commercial but they changed it, kudos to MS for that)
Blog: https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
Model weights: https://huggingface.co/microsoft/phi-2

Mamba-3B-SlimPJ

Mamba-3B-SlimPJ is a state-space model developed as a collaboration between Together AI & Cartesia AI, rivaling the best Transformer architectures with linear scaling in sequence length and fast inference. It has been trained on 600B tokens from the SlimPajama dataset.

It matches the quality of other high-performing 3B parameter models but requires 17% fewer FLOPs. Its performance is validated against models like the BTLM-3B-8K and even some 7B Transformer models.

Interesting because: Small, and does almost as well as StableLM-3B-4E1T which was trained on 7 times more tokens.
Released on: 2023-12-12
Parameters: 2.8 Billion
Context size: 2k
License: Apache 2.0
Blog: https://www.together.ai/blog/mamba-3b-slimpj
Model weights: https://huggingface.co/state-spaces/mamba-2.8b-slimpj
GitHub repo: https://github.com/state-spaces/mamba

DeciLM-7B

DeciLM-7B is a language model with a capacity of 7 billion parameters, which is positioned as having high throughput and accuracy within its parameter class. The model is licensed under Apache 2.0 and has achieved an average score of 61.55 on the Open LLM Leaderboard.

Its architecture utilizes a feature called variable Grouped Query Attention, which is an evolution from Multi-Query Attention, designed to offer a more efficient balance between computational speed and model accuracy. This architectural decision was facilitated by an AutoNAC-driven design process.

The DeciLM-7B-instruct finetune does worse than Mistral-7B-Instruct-v0.2 on /u/WolframRavenwolf’s tests.

Interesting because: They claim it’s better than Mistral 7B, which is already an achievement. Seems like it’s faster, too.
Released on: 2023-12-12
Parameters: 7.04 Billion
Context size: 8k
License: Apache-2.0
Blog: https://deci.ai/blog/introducing-decilm-7b-the-fastest-and-most-accurate-7b-large-language-model-to-date/
Model weights:
- Base: https://huggingface.co/Deci/DeciLM-7B
- Instruct, finetuned using SlimOrca: https://huggingface.co/Deci/DeciLM-7B-instruct
Finetuning script: https://colab.research.google.com/drive/1zsDQdlj-ry8MuP5E4qmpFeVf3_hRI1EL?usp=sharing

StripedHyena-7B

A model introduced by Together Research, showcasing a new architecture designed for long context, improved training, and inference performance over the traditional Transformer architecture. It combines multi-head, grouped-query attention and gated convolutions arranged in Hyena blocks. This model is suitable for processing long prompts and has been trained on sequences of up to 32k.

StripedHyena is the first model competitive with leading open-source Transformers for both short and long-context tasks. It introduces a hybrid architecture that includes state-space models for constant memory decoding and achieves low latency, faster decoding, and higher throughput. It also improves upon training and inference scaling laws compared to optimized Transformer models. Additionally, the model has been optimized with new model grafting techniques and trained on a mix of the RedPajama dataset and longer-context data.

Interesting because: Seems to work well and it’s not based on the mighty Transformer. Huge context size.
Released on: 2023-12-08
Parameters: 7.65 Billion
Context size: 128k (?)
License: Apache-2.0
Blog: https://www.together.ai/blog/stripedhyena-7b
Model weights:
- Base: https://huggingface.co/togethercomputer/StripedHyena-Hessian-7B
- Instruct: https://huggingface.co/togethercomputer/StripedHyena-Nous-7B
Minimal implementation: https://github.com/togethercomputer/stripedhyena

Mistral-7B

Mistral 7B is a powerful and efficient language model known for outperforming more sizable models on various benchmarks. It has specialized attention mechanisms for improved performance and inference speed.

Mistral 7B can handle longer sequences at a smaller cost due to its unique Sliding Window Attention (SWA) and Grouped-query attention (GQA), which allows for faster inference times. Interestingly, it performs almost as well as much larger models, effectively demonstrating a better cost/performance balance. SWA is cool because it means that while Mistral only looks at the last 4k tokens of context, each of those tokens looked at the 4k before it and so on. See the discussion here.

Interesting because: Ever since it was released it became the small model to beat
Released on: 2023-09-27
Parameters: 7.3 billion
Context size: 8k
License: Apache 2.0
Blog: https://mistral.ai/news/announcing-mistral-7b/
Model weights:
- Base: Mistral-7B-v0.1
- Instruct: Mistral-7B-Instruct-v0.1

Eagle-7B

I’ve been following RWKV for quite some time (large context size coupled with fast inference tend to get one’s attention) and Eagle-7B, their latest release, looks very nice.It’s a 7.52 billion parameter model based on the RWKV-v5 architecture, meaning it’s a linear transformer with significantly lower inference costs compared to traditional transformers.

It has been trained on 1.1 trillion tokens across more than 100 languages and aims to provide strong multi-lingual performance and reasonable English performance, approaching the benchmark set by other larger models.

Interesting because: Its computation cost scales linearly and not quadratically! Also, good multi-lingual performance
Released on: 2024-01-29
Parameters: 7.52 billion
Context size: Large ’n Confusing, see this too
License: Apache 2.0
Blog: https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers
Model weights: https://huggingface.co/RWKV/v5-Eagle-7B
Gradio Demo: https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2
PIP package: https://pypi.org/project/rwkv/
RWKV.cpp: https://github.com/saharNooby/rwkv.cpp

Mixtral 8x7B

A high-quality sparse mixture of experts model that outperforms Llama 2 70B and GPT3.5 on several benchmarks, with capabilities in handling multiple languages and performing well in code generation and instruction following.

The model’s ability to handle an extensive context of 32k tokens positions it for handling complex tasks that require processing large inputs. Mixtral is reported to excel not only in language tasks across English, French, Italian, German, and Spanish but also in specialized domains such as code generation.

Mistral used both supervised fine-tuning and direct preference optimization for the instruction-tuned variant, which achieves noteworthy results on MT-Bench, indicating its capacity for understanding and executing detailed instructions with precision.

Interesting because: It’s as fast as a 13B model, even though it takes 3+ times as much memory. Highest ranked open model on LMSYS’ ChatBot Arena, on par with GPT-3.5
Released on: 2023-12-11
Parameters: 46.7B total (memory), 12.9B used per token (speed)
Context size: 32k
License: Apache 2.0
Blog: https://mistral.ai/news/mixtral-of-experts/
Model weights:
- Base: https://huggingface.co/mistralai/Mixtral-8x7B-v0.1
- Instruct: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
/u/WolframRavenwolf tests for Mixtral-8x7B-Instruct-v0.1: https://www.reddit.com/r/LocalLLaMA/comments/18gz54r/llm_comparisontest_mixtral8x7b_mistral_decilm/
In-depth performance comparisons between Gemini Pro, GPT 3.5 Turbo, GPT 4 Turbo, and Mixtral. Mixtral seems to do a bit worse than 3.5 Turbo.
- https://hub.zenoml.com/report/2773/Gemini%20Mathematics
- https://hub.zenoml.com/report/2575/Gemini%20BBH
- Full research here: https://arxiv.org/abs/2312.11444

Yi

The Yi series models, developed by 01.AI, are a next-generation open-source bilingual large language models (LLMs).

Interesting because: They’re quite strong (although not as strong as Mixtral), and they offer a 200k context variant.
Released on: 2023-11-02 to 23
Parameters: 6B and 34B respectively
Context size: 4k for all except the -200k base models which have a 200k context size
License: Yi Series Models Community License Agreement 2.1 (custom), with models being fully open for academic research and (apparently) free for commercial use once you apply for it.
Model weights:
Web: https://01.ai
GitHub repo: https://github.com/01-ai/Yi

Multimodals (heh)

BakLLaVA-1

BakLLaVA-1 is a text generation model created by SkunkworksAI. It is based on the Mistral 7B model and enhanced with the LLaVA 1.5 architecture for multimodal capabilities. The model is designed to outperform Llama 2 13B on several benchmarks. It is open-source but uses certain non-commercially permissive data which the creators plan to address in an upcoming release (hello BakLLaVA-2!), which will utilize a larger, commercially viable dataset and a novel architecture.

Interesting because: It can “see”! Plus it’s based on Mistral 7B so I expect it to be comparable to LLaVA 13B
Parameters: 7.3 billion
License: Well. They say it’s Apache 2.0, but since it’s based on the LLaVA dataset this means used to be not commercial, but the LLaVa people changed it in the meantime because life is change and nothing stays the same on the Internet
Blog: https://twitter.com/skunkworks_ai/status/1713372586225156392
Model weights: https://huggingface.co/SkunkworksAI/BakLLaVA-1
Model Repository on GitHub: https://github.com/SkunkworksAI/BakLLaVA
Running locally with llama.cpp
- GGUF files for llama.cpp including mmproj-model-f16: https://huggingface.co/mys/ggml_bakllava-1
- GGUF conversion instructions here: https://github.com/ggerganov/llama.cpp/tree/master/examples/llava
- Inference tutorial: https://advanced-stack.com/resources/multi-modalities-inference-using-mistral-ai-llava-bakllava-and-llama-cpp.html
- Run in llama.cpp server: ./server -m models/BakLLaVA-1/ggml-model-q5_1.gguf -c 8192 --mmproj models/BakLLaVA-1/mmproj-model-f16.gguf

Yi-VL

Yi Visual Language (Yi-VL) model is an open-source, multimodal model capable of content comprehension, recognition, and multi-round conversations about images in both English and Chinese. It’s the multimodal version of the Yi Large Language Model (LLM) series. Available in 6B and 34B(!) flavors.

Interesting because: The first open-weights 34B vision language model available!
Released on: 2024-01-21
Parameters: 6 and 34 billion
Context size: 4k (since it’s based on Yi-Chat)
License: Yi Series Models Community License Agreement 2.1 (custom), with models being fully open for academic research and (apparently) free for commercial use once you apply for it.
Model weights:
- https://huggingface.co/01-ai/Yi-VL-6B
- https://huggingface.co/01-ai/Yi-VL-34B
Web: https://01.ai
GitHub repo: https://github.com/01-ai/Yi

LLaVA-1.6

LLaVA-1.6 is a large multimodal model (LMM) with improvements in reasoning, optical character recognition (OCR), and world knowledge over its predecessor, LLaVA-1.5. It’s designed for high-resolution visual inputs and has been optimized for better visual reasoning, conversation, and efficient deployment with SGLang.

It offers state-of-the-art (SoTA) performance on several benchmarks, even surpassing commercial models like Gemini Pro in some tests. Four models are available, two of them based on Vicuna, one on Mistral (does this make BakLLaVA-1 obsolete? 🤔), and one on Nous-Hermes-2-Yi :).

Interesting because: Second 34B vision language model :). Also, it appears to do better than Yi-VL in tests.
Released on: 2024-01-30
Parameters: 7, 13, and 34 billion
Context size: 8k for Mistral, 4k for Nous-Hermes-2-Yi, no idea for Vicuna
License: Apache-2.0
Blog: https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
Model weights:
- Mistral-7B: https://huggingface.co/liuhaotian/llava-v1.6-mistral-7b
- Nous-Hermes-2-Yi-34B: https://huggingface.co/liuhaotian/llava-v1.6-34b
- Vicuna:
  - https://huggingface.co/liuhaotian/llava-v1.6-vicuna-7b
  - https://huggingface.co/liuhaotian/llava-v1.6-vicuna-13b
WIP pull request for llama.cpp: https://github.com/ggerganov/llama.cpp/pull/5267

Watchlist

Models I haven’t had the time to look at/process, but sound interesting.

Cohere’s Command R & Command R+

Strong models. Non-commercial license. 35B and 102B parameters respectively. Released on 11 March 2024 and 04 April 2024 respectively. 128k tokens context window. They do well on Chatbot Arena

Prompt format: https://docs.cohere.com/docs/prompting-command-r
Command R: https://huggingface.co/CohereForAI/c4ai-command-r-v01
Command R Blog: https://txt.cohere.com/command-r/
Command R GGUF Quantizations: https://huggingface.co/andrewcanis/c4ai-command-r-v01-GGUF
Command R GGUF Quantizations Reddit Thread: https://www.reddit.com/r/LocalLLaMA/comments/1bft5qd/commandr_35b_open_weights_model_has_ggufllamacpp/
Command R+: https://huggingface.co/CohereForAI/c4ai-command-r-plus
Command R+ Blog: https://txt.cohere.com/command-r-plus-microsoft-azure/
Llama.cpp commit adding support for Command R: https://github.com/ggerganov/llama.cpp/commit/12247f4c69a173b9482f68aaa174ec37fc909ccf
llama-cpp-python issue: https://github.com/abetlen/llama-cpp-python/issues/1279

Starling-LM-7B-Beta

Strong 7B model. Placed just under Claude-2.1 in the LMSYS Chatbot Arena, which is crazy.

https://huggingface.co/Nexusflow/Starling-LM-7B-beta

Mi(s|x)tral 8x22B

No info yet, except a magnet link (and the fact that most people won’t be able to run this locally due to its size)

Blog post: https://mistral.ai/news/mixtral-8x22b/
Instruct: https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
Base: https://huggingface.co/mistralai/Mixtral-8x22B-v0.1

Qwen 1.5

Better than Qwen 1 (although it will still revert to Chinese randomly in multi-turn dialogues). Works in llama.cpp.

Blog post: https://qwenlm.github.io/blog/qwen1.5/
GitHub: https://github.com/QwenLM/Qwen1.5
32B: https://huggingface.co/Qwen/Qwen1.5-32B
32B-Chat GGUF: https://huggingface.co/Qwen/Qwen1.5-32B-Chat-GGUF
License: https://huggingface.co/Qwen/Qwen1.5-32B/blob/main/LICENSE

XGen-MM-phi3-mini-instruct-r-v1

Multimodal, small, non-commercial

https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1

Yi 1.5

https://huggingface.co/collections/01-ai/yi-15-2024-05-663f3ecab5f815a3eaca7ca8

Mistral 0.3

Function calling!

https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3

Qwen 2

New Mistral-sized MOE.

Qwen/Qwen2-72B & Qwen/Qwen2-57B-A14B

Codestral Mamba 7B

7B, Mamba2 architecture, Apache 2 license.

Base Language Models#

Phi-2#

Mamba-3B-SlimPJ#

DeciLM-7B#

StripedHyena-7B#

Mistral-7B#

Eagle-7B#

Mixtral 8x7B#

Yi#

Multimodals (heh)#

BakLLaVA-1#

Yi-VL#

LLaVA-1.6#

Watchlist#

Cohere’s Command R & Command R+#

Starling-LM-7B-Beta#

Mi(s|x)tral 8x22B#

Qwen 1.5#

XGen-MM-phi3-mini-instruct-r-v1#

Yi 1.5#

Mistral 0.3#

Qwen 2#

Codestral Mamba 7B#