Home Blog From Scratch: 2M-Parameter LLM in Python

Creating a 2M Parameter Thinking LLM from Scratch Using Python

The world of language models has often been associated with massive, resource-intensive systems like GPT-4 or PaLM, boasting billions of parameters and trained on enormous datasets. However, a new wave of small yet capable language models is gaining attention — models like OpenChat’s o3 or DeepSeek-R1. These lightweight models have fewer than 2 million parameters yet demonstrate surprising competence in reasoning and general-purpose tasks.

In this blog post, we’ll explore how you can build a 2M parameter “thinking” LLM entirely from scratch, using Python and open-source tools — without requiring a supercomputer.

Chart showing benefits of IDP

Why Build a Small Language Model?

While massive LLMs capture headlines, small models come with their own powerful advantages:

✅ Fast Training – Small models can be trained in hours instead of days.
✅ Lower Costs – You don’t need expensive GPUs or cloud infrastructure.
✅ Deploy Anywhere – Perfect for edge devices, mobile apps, or offline use.
✅ Customizable – You can fine-tune for specific use cases without large datasets.

Most importantly, they offer an excellent learning opportunity. You’ll understand the inner workings of transformers, attention mechanisms, and training dynamics — all while building a usable AI system.

The Core Idea: Transformers at a Small Scale

Even with just 2 million parameters, a well-designed transformer can learn how to generate, complete, and even reason through text. These compact models use the same building blocks as their larger cousins:

Token and Position Embeddings: Turning text into a form the model can understand.
Self-Attention Mechanisms: Allowing the model to “focus” on relevant words.
Layer Norms and Feedforward Layers: For smoother learning and generalization.
Final Prediction Head: Producing the next-word predictions or outputs.

The key is to scale down wisely — using smaller embeddings, fewer attention heads, and shallower layers — while still retaining the core structure.

How to Keep It Under 2 Million Parameters

To stay within the 2M parameter limit, everything must be optimized:

Use a smaller vocabulary size, especially if focusing on a specific domain.
Limit the number of layers and attention heads.
Choose compact embedding dimensions.
Reduce the size of the internal feedforward layers.

Even with these constraints, your model can still learn to reason, generate text, and perform tasks if trained properly.

t include text paired with images, video, or sound. This enables the models to learn how different types of information relate to each other.

Data: The Most Crucial Ingredient

What your model learns depends heavily on what it’s fed. Here’s what to keep in mind:

Focus on high-quality, domain-specific data.
Use clean, well-structured examples for reasoning.
Chain-of-thought (CoT) style training helps the model learn to “think” step-by-step.

Q: If you have 3 apples and give away 1, how many are left?
A: You started with 3. You gave away 1. That leaves 2.
Answer: 2.

Even small models can learn this kind of logic — if trained on thousands of similar examples.

Let’s Start a Conversation

Big ideas begin with small steps.

Whether you're exploring options or ready to build, we're here to help.

Let’s connect and create something great together.