Build your own ChatGPT from scratch in C++
Most of us use LLMs every day, but we treat them as a magical API black box. When you start diving deeper, you realize it’s really just basic math.
Torchless is a lightweight inference engine built from scratch that runs LLMs locally on CPU. Everything is implemented in pure C++, you can understand exactly how a modern LLM runs, from initial user input to next token prediction.
The engine starts by converting the Hugging Face model into one compact binary file. That file holds the configuration, the vocabulary, the merge rules, and every tensor the model needs. When you launch the program, the binary is memory-mapped straight into your process.
From there, the tokenizer turns your prompt into integer IDs using byte-pair encoding. This is where the model stops dealing with text and switches to numbers. Once you have the token list, you feed the IDs into the transformer one by one. The engine keeps a single hidden vector that represents the current state of the model. Every layer reads that vector, transforms it, and hands it off to the next.
Attention is where the sequence comes into play. Each new token produces a query, key, and value. The keys and values are appended to a cache, which means the model never recomputes the past. With every step, the query looks back over that history and pulls out whatever information matters for predicting the next word.
After the attention pass, the MLP processes the result, stretching the hidden vector into a larger space and collapsing it back down. Once the last layer finishes, the output vector is projected over the entire vocabulary. That single vector becomes a list of scores for every possible next token. Pick the highest one, or sample from the probability distribution, and you get the next word in your generated text.
This project was extremely hardcore but rewarding in terms of learning. I encourage people to take a deeper look!
submitted by /u/Sweet_Ladder_8807
[link] [comments]