Small Language Models (SLM): Better Results with Less Computing Power

Small language models are changing the AI landscape by delivering great results with far fewer resources. Large language models pack hundreds of billions or even trillions of parameters, while SLMs work with just 1 million to 10 billion parameters. This huge size difference makes SLMs perfect for environments where resources are limited.

Phi-3-mini stands out as one of the newest SLMs with just 3.8 billion parameters. This compact model matches its bigger cousins like Mixtral 8x7B and GPT-3.5, with a 69% score on MMLU and 8.38 on MT-Bench. SLMs need less memory and computing power, which makes them ideal for edge devices and mobile apps. These efficient models achieve such performance through various techniques. DistilBERT shows this well – it’s 40% smaller and 60% faster than BERT while keeping 97% of its language understanding capabilities.

This piece will get into what makes small language models tick, how they stack up against larger models, and why they matter for practical AI deployment. We’ll look at the best open source small language models available now and explore their strengths and limitations in real-life applications.

What Are Small Language Models and Why They Matter

Small Language Models (SLMs) signal a radical alteration in AI development. These models focus on efficiency and specialized performance instead of size. The compact neural networks can process, understand, and generate natural language while using fewer resources than larger models.

SLM parameter range: 1M to 10B

Language models learn through internal variables called parameters – weights and biases. SLMs typically use 1 million to 10 billion parameters. Large language models (LLMs) dwarf these numbers with hundreds of billions or even trillions of parameters.

Several models showcase this range:

Phi-3 Mini runs on 3.8 billion parameters
Gemma models work with 2 billion and 7 billion parameters
Mistral’s compact model uses 7.3 billion parameters
Meta’s Llama 3.2 comes with 1 billion and 3 billion parameter versions

These numbers make sense. Models with about 8 billion parameters can train on a single consumer-grade NVIDIA RTX 4090 GPU with 24GB of memory. Major LLM providers like Amazon and Microsoft commonly offer 7B models as their standard options.

Transformer-based architecture in SLMs

SLMs use the same transformer-based neural network architecture that powers bigger models. This architecture now forms the foundations of natural language processing and serves as the building blocks for models like GPT.

The transformer architecture works through several key parts:

Encoders change input text into numerical representations called embeddings
A self-attention mechanism helps the model focus on relevant words whatever their position
Decoders use these mechanisms to create statistically probable output sequences

Self-attention helps SLMs grasp context. They can figure out whether “Paris” means a city or a person, which makes them work surprisingly well despite their smaller size.

SLMs vs small LLMs: terminology clarification

The industry hasn’t settled on consistent terms. Some experts challenge the phrase “Small Language Model” since a billion parameters isn’t exactly “small” by normal computing standards. Others suggest “Small Large Language Model,” though that sounds awkward.

The real difference often comes down to how these models train:

Models optimized for specific domains usually count as SLMs
Scaled-down versions of general-purpose models might be called small LLMs

One researcher points out that these models “are small only when matched against the large ones”. This relative comparison explains why different sources might group models differently.

“Small Language Model” (SLM) has become the standard term for models with fewer than 10 billion parameters. Some research papers include models up to 13 billion parameters as special cases.

SLMs matter beyond just being smaller. Their efficient design lets them run on devices with limited resources like edge devices and mobile apps. This opens up AI integration possibilities that weren’t available before due to hardware limits or connection needs.

How Small Language Models Are Built Efficiently

Creating quick small language models needs special techniques that cut down on computing power but still work well. Right now, four main engineering approaches help turn bigger models into smaller, more efficient ones.

Knowledge distillation from teacher to student models

Knowledge distillation passes expertise from a bigger pre-trained model (the “teacher”) to a compact model (the “student”). The student model learns from the teacher’s refined insights instead of processing huge amounts of raw data directly. This helps the student capture the teacher’s core abilities without all the complexity.

The student model copies both the teacher’s outputs and sometimes its processing steps. The distillation uses a special loss function that looks at how closely the student’s probability matches the teacher’s. This method works really well – one example showed a distilled model that performed better than previous standards. It cut training time by about 70% and used 25% fewer parameters.

Pruning redundant weights and neurons

Pruning finds and removes the less important connections or neurons in the network. This simple process cuts out extra parts that don’t help the model’s performance much. Taking out these unnecessary pieces makes the model run faster and use resources better.

There are two main ways to prune:

Weight pruning: Takes out individual weights that don’t contribute much to outputs
Structured pruning: Removes whole layers, neurons, or channels

Structured width pruning of MLP layers can substantially shrink model size while keeping outputs coherent. Models usually get fine-tuned after pruning to get back any lost accuracy. This works especially well for MLP layers, which make up more than 50% of a model’s parameters.

Quantization: 8-bit vs 32-bit precision

Quantization cuts down precision by changing high-precision data (usually 32-bit floating point or FP32) to simpler formats like 8-bit integers (INT8). This maps the many possible FP32 values to a much smaller set in the lower-precision format.

The process of changing FP32 to INT8 finds the best way to map original values to just 256 possibilities in 8-bit form. You can do this two ways:

Symmetric quantization: Maps values evenly around zero
Asymmetric quantization: Maps based on minimum and maximum values

The benefits add up quickly – quantized models need less storage, use less power, and do matrix math faster with integer arithmetic. This lets them run on simple devices that only work with integers.

Low-rank factorization for matrix simplification

Low-rank factorization breaks down big weight matrices into smaller, simpler ones while keeping their key functions. This creates compact versions with fewer parameters, which cuts down on calculations and makes complex matrix operations easier.

The usual method uses Singular Value Decomposition (SVD) to approximate matrices with fewer parameters. Standard SVD just tries to rebuild the original matrix without thinking about which parameters matter most for accuracy. Newer approaches like Fisher-Weighted SVD look at Fisher information to weigh parameter importance, so models stay accurate even with more compression.

Recent advances include Learning to Low-Rank Compress (LLRC), which learns masks to pick singular values during compression. Using only 3,000 calibration documents, this works better than other methods across compression rates for common-sense reasoning and question-answering tasks.

Best Small Language Models in 2024

Small language models have evolved faster in 2024. Several models now deliver exceptional capabilities at a fraction of the computational cost compared to larger models.

Phi-3 Mini (3.8B) for reasoning and code generation

Microsoft’s Phi-3 Mini represents a breakthrough in small language model design with just 3.8 billion parameters. This lightweight powerhouse excels at reasoning tasks, especially when you have mathematics and logical problem-solving. The model achieves 69.7% accuracy on the MMLU standard, which shows its resilient language understanding capabilities. Its long-context capabilities support context lengths up to 128K tokens, making it perfect for processing large documents or code bases. The model performs better than many models twice its size, especially in reasoning and code generation tasks.

Gemma 2B and 7B for multilingual tasks

Google DeepMind’s Gemma models excel in multilingual applications for computing environments of all types. These models come in 2 billion and 7 billion parameter configurations and deliver state-of-the-art performance for their size class. Gemma 2B combines compact size with strong conversational AI capabilities. The 7B variant offers better language processing power while still running on standard hardware.

Llama 3.2-1B for mobile and edge devices

Meta’s Llama 3.2-1B marks a major advance for on-device AI applications. The model supports a large 128K token context length and creates tailored, privacy-focused experiences where data stays on the device. Llama 3.2-1B performs well at summarization, instruction following, and rewriting tasks while running locally. Optimized implementations on Arm-powered mobile devices deliver 5x improvement in prompt processing and 3x improvement in token generation.

Mistral 7B and 8B for general-purpose NLP

Mistral AI’s models have become skilled performers for general NLP tasks. Mistral 7B performs better than Llama 2 13B across all standards, which shows exceptional efficiency. The model’s architecture uses Grouped-query attention and Sliding Window Attention to process longer sequences faster. Mixtral 8x7B employs a sparse mixture-of-experts approach that uses 12.9B parameters per token despite having 46.7B total parameters. Both models work with English and other languages like French, Italian, German, and Spanish.

SmolLM2-1.7B for education and math tasks

HuggingFace’s SmolLM2-1.7B shows that even smaller models can excel in specialized domains. The model trained on 11 trillion tokens shows better instruction following, knowledge, reasoning, and mathematics compared to its predecessor. The math-specialized variant solves complex mathematical problems impressively, which makes it a great tool for educational applications that need step-by-step explanations.