{"id":137,"date":"2025-08-20T06:16:05","date_gmt":"2025-08-20T06:16:05","guid":{"rendered":"https:\/\/hattussa.com\/blog\/?p=137"},"modified":"2025-12-16T13:05:52","modified_gmt":"2025-12-16T13:05:52","slug":"from-scratch-2m-parameter-llm-in-python","status":"publish","type":"post","link":"https:\/\/hattussa.com\/blog\/from-scratch-2m-parameter-llm-in-python\/","title":{"rendered":"From Scratch: 2M-Parameter LLM in Python"},"content":{"rendered":"<p><!-- ----------------------top-1------------------ --><\/p>\n<section class=\"section-2 service-top\">\n<!-- \n\n<div class=\"container-blog top-bar1\">\n  <button class=\"back-button\" onclick=\"history.back()\">\u2190 Back<\/button>\n  \n\n<div class=\"breadcrumb1\">\n    <span>Home<\/span>\n    <span>Blog<\/span>\n    <span>Creating a 2M Parameter Thinking LLM from Scratch Using Python<\/span>\n  <\/div>\n\n\n<\/div>\n\n --><\/p>\n<p><!-- ----------------------top-2-------------------- --><\/p>\n<div class=\"container\" style=\"align-items: start;\">\n<p>  <!-- Left Sidebar --><\/p>\n<div class=\"sidebar left-sidebar\">\n<div class=\"toc-title\">Table of contents<\/div>\n<ul class=\"toc-list\" id=\"toc\">\n<li data-target=\"section1\">Creating a 2M Parameter Thinking LLM from Scratch Using Python<\/li>\n<li data-target=\"section1\">Why Build a Small Language Model?<\/li>\n<li data-target=\"section3\">The Core Idea: Transformers at a Small Scale<\/li>\n<li data-target=\"section4\">How to Keep It Under 2 Million Parameters<\/li>\n<li data-target=\"section5\">Data: The Most Crucial Ingredient<\/li>\n<\/ul><\/div>\n<p>  <!-- Main Content --><\/p>\n<div class=\"content-blog\">\n<section id=\"section1\">\n<h1>Creating a 2M Parameter Thinking LLM from Scratch Using Python<\/h1>\n<p>The world of language models has often been associated with massive, resource-intensive systems like GPT-4 or PaLM, boasting billions of parameters and trained on enormous datasets. However, a new wave of <strong>small yet capable language models<\/strong> is gaining attention \u2014 models like <strong>OpenChat&#8217;s o3<\/strong> or <strong>DeepSeek-R1<\/strong>. These lightweight models have fewer than 2 million parameters yet demonstrate surprising competence in reasoning and general-purpose tasks.<\/p>\n<p>In this blog post, we&#8217;ll explore how you can build a <strong>2M parameter &#8220;thinking&#8221; LLM<\/strong> entirely from scratch, using Python and open-source tools \u2014 without requiring a supercomputer.<\/p>\n<p>  <img decoding=\"async\" src=\"https:\/\/hattussa.com\/assets\/images\/blog\/blog-4.webp\" alt=\"Chart showing benefits of IDP\" class=\"img-fluid\" title=\" 2M-Parameter LLM in Python\" width=\"100%\" height=\"auto\"\/><br \/>\n    <\/section>\n<section id=\"section2\">\n<h2>Why Build a Small Language Model?<\/h2>\n<p>While massive LLMs capture headlines, small models come with their own powerful advantages:<\/p>\n<ul>\n<li>\u2705 <strong>Fast Training<\/strong> \u2013 Small models can be trained in hours instead of days.<\/li>\n<li>\u2705 <strong>Lower Costs<\/strong> \u2013 You don\u2019t need expensive GPUs or cloud infrastructure.<\/li>\n<li>\u2705 <strong>Deploy Anywhere<\/strong> \u2013 Perfect for edge devices, mobile apps, or offline use.<\/li>\n<li>\u2705 <strong>Customizable<\/strong> \u2013 You can fine-tune for specific use cases without large datasets.<\/li>\n<\/ul>\n<p>Most importantly, they offer an excellent <strong>learning opportunity<\/strong>. You\u2019ll understand the inner workings of transformers, attention mechanisms, and training dynamics \u2014 all while building a usable AI system.<\/p>\n<\/section>\n<section id=\"section3\">\n<h2>The Core Idea: Transformers at a Small Scale<\/h2>\n<p>Even with just 2 million parameters, a well-designed transformer can learn how to generate, complete, and even reason through text. These compact models use the same building blocks as their larger cousins:<\/p>\n<ul>\n<li><strong>Token and Position Embeddings<\/strong>: Turning text into a form the model can understand.<\/li>\n<li><strong>Self-Attention Mechanisms<\/strong>: Allowing the model to &#8220;focus&#8221; on relevant words.<\/li>\n<li><strong>Layer Norms and Feedforward Layers<\/strong>: For smoother learning and generalization.<\/li>\n<li><strong>Final Prediction Head<\/strong>: Producing the next-word predictions or outputs.<\/li>\n<\/ul>\n<p>The key is to <strong>scale down<\/strong> wisely \u2014 using smaller embeddings, fewer attention heads, and shallower layers \u2014 while still retaining the core structure.<\/p>\n<\/section>\n<section id=\"section4\">\n<h2>How to Keep It Under 2 Million Parameters<\/h2>\n<p>To stay within the 2M parameter limit, everything must be optimized:<\/p>\n<ul>\n<li>Use a <strong>smaller vocabulary size<\/strong>, especially if focusing on a specific domain.<\/li>\n<li>Limit the number of <strong>layers and attention heads<\/strong>.<\/li>\n<li>Choose compact <strong>embedding dimensions<\/strong>.<\/li>\n<li>Reduce the size of the <strong>internal feedforward layers<\/strong>.<\/li>\n<\/ul>\n<p>Even with these constraints, your model can still <strong>learn to reason<\/strong>, generate text, and perform tasks if trained properly.<\/p>\n<p>t include text paired with images, video, or sound. This enables the models to learn how different types of information relate to each other.<\/p>\n<\/section>\n<section id=\"section5\">\n<h2>Data: The Most Crucial Ingredient<\/h2>\n<p>What your model learns depends heavily on what it\u2019s fed. Here\u2019s what to keep in mind:<\/p>\n<ul>\n<li>Focus on <strong>high-quality, domain-specific data<\/strong>.<\/li>\n<li>Use <strong>clean, well-structured examples<\/strong> for reasoning.<\/li>\n<li><strong>Chain-of-thought (CoT)<\/strong> style training helps the model learn to \u201cthink\u201d step-by-step.<\/li>\n<\/ul>\n<blockquote>\n<p><strong>Q:<\/strong> If you have 3 apples and give away 1, how many are left?<br \/>\n      <strong>A:<\/strong> You started with 3. You gave away 1. That leaves 2.<br \/>\n      Answer: 2.<\/p>\n<\/blockquote>\n<p>Even small models can learn this kind of logic \u2014 if trained on thousands of similar examples.<\/p>\n<\/section><\/div>\n<p>  <!-- Right Sidebar --><br \/>\n  <!-- \n\n<div class=\"sidebar right-sidebar\">\n    \n\n<div class=\"meta\">\n      \n\n<div class=\"date\">Feb 18, 25<\/div>\n\n\n      \n\n<div class=\"author\">\n        <img decoding=\"async\" src=\".\/assets\/images\/ai-image.webp\" alt=\"Author\" class=\"author-img\" title=\"ai-images\" width=\"100%\" height=\"auto\"\/>\n        \n\n<div>\n          <small>Written by<\/small>\n          <strong>Sean Kettering<\/strong>\n        <\/div>\n\n\n      <\/div>\n\n\n      \n\n<div class=\"tags\">\n        <span>Tags<\/span>\n        \n\n<div class=\"tag\">Animals<\/div>\n\n\n      <\/div>\n\n\n    <\/div>\n\n\n  <\/div>\n\n -->\n<\/div>\n<\/section>\n","protected":false},"excerpt":{"rendered":"<p>The world of language models has often been associated with massive, resource-intensive systems like GPT-4 or PaLM, boasting billions of parameters and trained on enormous datasets. However, a new wave of <strong>small yet capable language models<\/strong> is gaining attention \u2014 models like <strong>OpenChat&#8217;s o3<\/strong> or <strong>DeepSeek-R1<\/strong>. These lightweight models have fewer than 2 million parameters yet demonstrate surprising competence in reasoning and general-purpose tasks.<\/p>\n","protected":false},"author":1,"featured_media":138,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-137","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/posts\/137","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/comments?post=137"}],"version-history":[{"count":4,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/posts\/137\/revisions"}],"predecessor-version":[{"id":325,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/posts\/137\/revisions\/325"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/media\/138"}],"wp:attachment":[{"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/media?parent=137"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/categories?post=137"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/hattussa.com\/blog\/wp-json\/wp\/v2\/tags?post=137"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}