Base Language Models

This is a non-exhaustive list of large language models that I find interesting. I’ve just started building it so it’s very much incomplete. Contains only base models for now, not sure if/how I’ll include any finetunes.

Criteria for including a model:

  • It needs to be interesting to me personally, not necessarily beat by 0.1% whatever llama finetune was on top of OpenLLM leaderboard at the time of its release.
  • It needs to be available for commercial usage (or be really really cool, like BakLLaVA). I need to be able to use them to generate code/ideas/whatever that may or may not generate revenue. That excludes models like Orca-2 which can only be used for “non-commercial, non-revenue generating, research purposes”.
  • I need to be able to run it locally, on a Mac M2 Max. This means either smaller models (13B tops), or something that runs in llama.cpp.

I’ll be referencing some comparisons as well:


Phi-2 is a 2.7 billion parameter Transformer-based model designed for tasks involving common sense reasoning, language understanding, logical reasoning, and Python code generation. It has been trained on a blend of NLP synthetic texts, filtered web data, and teaching-focused datasets. Plus, it can go head to head with other models that have less than 13 billion parameters.


Mamba-3B-SlimPJ is a state-space model developed as a collaboration between Together AI & Cartesia AI, rivaling the best Transformer architectures with linear scaling in sequence length and fast inference. It has been trained on 600B tokens from the SlimPajama dataset.

It matches the quality of other high-performing 3B parameter models but requires 17% fewer FLOPs. Its performance is validated against models like the BTLM-3B-8K and even some 7B Transformer models.


DeciLM-7B is a language model with a capacity of 7 billion parameters, which is positioned as having high throughput and accuracy within its parameter class. The model is licensed under Apache 2.0 and has achieved an average score of 61.55 on the Open LLM Leaderboard.

Its architecture utilizes a feature called variable Grouped Query Attention, which is an evolution from Multi-Query Attention, designed to offer a more efficient balance between computational speed and model accuracy. This architectural decision was facilitated by an AutoNAC-driven design process.

The DeciLM-7B-instruct finetune does worse than Mistral-7B-Instruct-v0.2 on /u/WolframRavenwolf’s tests.


A model introduced by Together Research, showcasing a new architecture designed for long context, improved training, and inference performance over the traditional Transformer architecture. It combines multi-head, grouped-query attention and gated convolutions arranged in Hyena blocks. This model is suitable for processing long prompts and has been trained on sequences of up to 32k.

StripedHyena is the first model competitive with leading open-source Transformers for both short and long-context tasks. It introduces a hybrid architecture that includes state-space models for constant memory decoding and achieves low latency, faster decoding, and higher throughput. It also improves upon training and inference scaling laws compared to optimized Transformer models. Additionally, the model has been optimized with new model grafting techniques and trained on a mix of the RedPajama dataset and longer-context data.


Mistral 7B is a powerful and efficient language model known for outperforming more sizable models on various benchmarks. It has specialized attention mechanisms for improved performance and inference speed.

Mistral 7B can handle longer sequences at a smaller cost due to its unique Sliding Window Attention (SWA) and Grouped-query attention (GQA), which allows for faster inference times. Interestingly, it performs almost as well as much larger models, effectively demonstrating a better cost/performance balance. SWA is cool because it means that while Mistral only looks at the last 4k tokens of context, each of those tokens looked at the 4k before it and so on. See the discussion here.


I’ve been following RWKV for quite some time (large context size coupled with fast inference tend to get one’s attention) and Eagle-7B, their latest release, looks very nice.It’s a 7.52 billion parameter model based on the RWKV-v5 architecture, meaning it’s a linear transformer with significantly lower inference costs compared to traditional transformers.

It has been trained on 1.1 trillion tokens across more than 100 languages and aims to provide strong multi-lingual performance and reasonable English performance, approaching the benchmark set by other larger models.

Mixtral 8x7B

A high-quality sparse mixture of experts model that outperforms Llama 2 70B and GPT3.5 on several benchmarks, with capabilities in handling multiple languages and performing well in code generation and instruction following.

The model’s ability to handle an extensive context of 32k tokens positions it for handling complex tasks that require processing large inputs. Mixtral is reported to excel not only in language tasks across English, French, Italian, German, and Spanish but also in specialized domains such as code generation.

Mistral used both supervised fine-tuning and direct preference optimization for the instruction-tuned variant, which achieves noteworthy results on MT-Bench, indicating its capacity for understanding and executing detailed instructions with precision.


The Yi series models, developed by 01.AI, are a next-generation open-source bilingual large language models (LLMs).

Multimodals (heh)


BakLLaVA-1 is a text generation model created by SkunkworksAI. It is based on the Mistral 7B model and enhanced with the LLaVA 1.5 architecture for multimodal capabilities. The model is designed to outperform Llama 2 13B on several benchmarks. It is open-source but uses certain non-commercially permissive data which the creators plan to address in an upcoming release (hello BakLLaVA-2!), which will utilize a larger, commercially viable dataset and a novel architecture.


Yi Visual Language (Yi-VL) model is an open-source, multimodal model capable of content comprehension, recognition, and multi-round conversations about images in both English and Chinese. It’s the multimodal version of the Yi Large Language Model (LLM) series. Available in 6B and 34B(!) flavors.


LLaVA-1.6 is a large multimodal model (LMM) with improvements in reasoning, optical character recognition (OCR), and world knowledge over its predecessor, LLaVA-1.5. It’s designed for high-resolution visual inputs and has been optimized for better visual reasoning, conversation, and efficient deployment with SGLang.

It offers state-of-the-art (SoTA) performance on several benchmarks, even surpassing commercial models like Gemini Pro in some tests. Four models are available, two of them based on Vicuna, one on Mistral (does this make BakLLaVA-1 obsolete? 🤔), and one on Nous-Hermes-2-Yi :).


Models I haven’t had the time to look at/process, but sound interesting.

Cohere’s Command R & Command R+

Strong models. Non-commercial license. 35B and 102B parameters respectively. Released on 11 March 2024 and 04 April 2024 respectively. 128k tokens context window. They do well on Chatbot Arena


Strong 7B model. Placed just under Claude-2.1 in the LMSYS Chatbot Arena, which is crazy.

Mi(s|x)tral 8x22B

No info yet, except a magnet link (and the fact that most people won’t be able to run this locally due to its size)

Qwen 1.5

Better than Qwen 1 (although it will still revert to Chinese randomly in multi-turn dialogues). Works in llama.cpp.


Multimodal, small, non-commercial

Yi 1.5

Mistral 0.3

Function calling!

Qwen 2

New Mistral-sized MOE.

Qwen/Qwen2-72B & Qwen/Qwen2-57B-A14B