Back to blog

A History of llama.cpp

Apr 13, 2026 - llms, dossier

On February 24, 2023, Meta published LLaMA as a gated research preview, access restricted to approved academics and labs. Within a week, the weights had leaked onto 4chan and spread across BitTorrent. The models rivaled OpenAI's GPT-3, but running them came with a severe limitation: they required datacenter GPUs. For the average developer, they were still untouchable — the Python ecosystem demanded tens of gigabytes of VRAM just to load the weights.

Georgi Gerganov, a Bulgarian developer known for his whisper.cpp project, saw this barrier and dismantled it. On March 10, 2023, he committed the first lines of llama.cpp to GitHub. He abandoned Python entirely. Gerganov ported the LLaMA architecture to pure C and C++, creating an inference engine that ran entirely on the CPU. It had no dependencies. It required no specialized drivers. With a MacBook, Windows laptop, or Raspberry Pi, users could run state-of-the-art language models locally.

The engine behind this breakthrough was ggml, a custom tensor library Gerganov built specifically for edge inference. Unlike industry standards like TensorFlow or PyTorch, which were optimized for training on massive GPU clusters, ggml was optimized for forward-pass inference on CPUs. It used Apple's Accelerate framework and ARM NEON intrinsics, extracting more performance from consumer hardware. LLaMA models previously requiring an A100 GPU could now generate text at conversational speeds on an M1 MacBook Air.

The project quickly gained contributors and stars on GitHub. But the real revolution came with the introduction of quantization.

Running a 7-billion parameter model in 16-bit float still required 14 gigabytes of RAM. For many users, this remained a bottleneck. Integer quantization came next. Contributors found they could compress model weights from 16-bit floats down to 4-bit integers with minimal loss in generation quality, cutting memory requirements by nearly 75 percent — enough to bring frontier models within reach on a cheap laptop.

Early 4-bit integer quantization schemas relied on naive uniform quantization. The algorithm divided a tensor into blocks of 32 weights, extracted a single 16-bit floating-point scale, and rounded the remaining values to 4-bit integers. This linear mapping reduced memory footprints but severely degraded perplexity. The uniform distribution assumption failed to capture outlier activations, causing the model outputs to hallucinate.

Contributors introduced K-quants to solve precision loss through hierarchical, mixed-precision quantization. The architecture grouped weights into super-blocks of 256 parameters, subdividing these into 16-weight sub-blocks. The algorithm computed a 16-bit float scale and minimum value for the entire super-block, then stored highly compressed 6-bit or 8-bit scale multipliers for each sub-block. This hierarchical scaling let the engine assign varying bit-depths across architectural components. K-quants minimized divergence from the uncompressed baseline by isolating precision loss to dense layers while preserving accuracy in attention routing layers — a 70-billion parameter model could fit within 48 gigabytes of RAM with output quality indistinguishable from the 16-bit original.

The quantization scheme gave rise to the ggml file format, which became the de facto standard for distributing local language models. But as the ecosystem grew, the original ggml format showed its limitations. It lacked metadata flexibility. Every time the community added support for a new model architecture, the format had to be tweaked, often breaking compatibility with older files. The format lacked extensibility, relying on a "magic number" versioning system where any hyperparameter change broke the parser and required redownloading weights.

In August 2023, the community executed a necessary but painful transition. They introduced GGUF, the Generative General Universal Format. GGUF decoupled the model architecture from the file structure. It stored all hyperparameters and metadata in a flexible key-value format directly within the file. This meant llama.cpp could support new models without breaking older files. The transition caused temporary chaos as users converted ggml models to GGUF, but it laid the foundation for the architectural explosion that followed. GGUF embedded tokenization dictionaries, special token IDs, and system prompt templates directly into the binary file. Previously, developers bundled auxiliary JSON files alongside weights to ensure correct input formatting. Consolidating these into the binary eliminated the mismatch errors that had plagued early local deployments. The key-value structure also supported heterogeneous tensor types within the same file, allowing developers to quantize the massive multi-head attention layers while keeping the more sensitive layer normalization weights in full 16-bit precision.

In 2024, focus shifted to hardware acceleration. The community bolted hardware-specific backends onto the ggml core: CUDA for NVIDIA, Metal for Apple, ROCm for AMD, and Vulkan for cross-platform support. This multi-backend approach was powerful but messy. Early iterations relied on a simplistic bump allocator within a monolithic context. Developers manually defined scratch buffers for each transformer layer to prevent intermediate computations from exhausting system RAM. As users began spreading inference across heterogeneous hardware configurations, the lack of a centralized memory manager led to brittle, backend-specific VRAM allocation logic.

To solve this, the core team introduced the ggml-backend API in early 2024. This isolated the tensor mathematics from physical device memory management. Instead of allocating memory eagerly, the engine adopted a two-phase approach. It built a directed acyclic graph representing the sequence of operations for a batch of tokens, then analyzed it to determine the exact lifecycle of intermediate tensors. The graph allocator shrunk the required compute buffer size by up to an order of magnitude, overlapping the memory addresses of tensors whose lifetimes did not logically coincide. This dynamic allocation approach proved necessary for processing massive context windows without instantly triggering out-of-memory errors.

To handle multi-device execution, ggml-backend introduced a strict graph scheduler. When a user split a model between a CPU and a discrete GPU, the scheduler traversed the computation graph and dynamically assigned nodes to specific devices based on hardware capabilities and where the weight tensors physically resided. If an operation required data from a different device, the scheduler injected explicit tensor-copy nodes, handling PCIe transfers synchronously. This decoupling of graph definition from execution allowed llama.cpp to execute complex topologies across fragmented arrays without modifying mathematical kernels.

The release of Meta's Llama 3 in April 2024 stressed the engine's architectural assumptions. Llama 3 introduced a 128,000-token vocabulary and a complex new tokenizer based on tiktoken, abandoning the SentencePiece implementation used in previous generations. The original tokenization code initially choked on the large dictionary. Maintainers responded with a rewrite of the vocabulary parsing logic, implementing parallelized string matching and caching to handle the inflated token space without degrading ingestion speeds. Furthermore, Llama 3's reliance on highly specific rotary positional embeddings (RoPE) for extending its context window required a complete overhaul of the frequency scaling algorithms within the ggml math library. The community raced to implement dynamic scaling factors, allowing users to stretch the model's context far beyond its native training length without degradation in coherence.

By mid-2024, llama.cpp had become the underlying engine for a vast ecosystem of local AI tools. Applications like LM Studio, Ollama, and Faraday relied on the llama.cpp server implementation to power their slick user interfaces. The project added full OpenAI API compatibility, allowing developers to swap cloud models for local ones simply by changing a local host URL.

To handle massive context windows, the team backported Flash Attention into the C++ and compute shader backends. Standard attention mechanisms scale quadratically. Computing attention for a massive sequence length would instantly exhaust consumer VRAM. The llama.cpp developers translated this highly optimized mathematics into raw Metal Shading Language and Vulkan compute shaders. The backported implementation used aggressive tiling. It loaded small blocks of the Query, Key, and Value matrices from high-capacity but slow VRAM into the GPU's ultra-fast, on-chip SRAM. The engine computed the attention scores incrementally within the SRAM and wrote only the final output back to main memory.

As the ecosystem matured in late 2024, the demands placed on local models shifted from simple chat interfaces to complex, agentic workflows. Developers needed guaranteed structural formats like JSON to pipe the output of llama.cpp directly into traditional software pipelines. To meet this demand, contributors engineered a grammar-based sampling subsystem. Instead of merely calculating the most probable next token, the engine introduced a deterministic state machine that evaluated each potential token against a user-defined formal grammar. If a high-probability token violated the syntactic rules of the requested JSON schema, the sampler forced its logit to negative infinity, selecting a structurally valid alternative. Guaranteed output formatting transformed llama.cpp from a standalone tool into a reliable backend for programmatic AI workflows, giving local models the kind of reliability previously reserved for closed-source API providers.

State-space models like Mamba and RWKV, which discarded the attention mechanism entirely in favor of recurrent linear operations, forced ggml-backend to introduce dedicated operator types for sequential state updates. The developers wrote custom kernels that kept recurrent state vectors resident in GPU cache, reducing the memory bandwidth pressure that had traditionally degraded recurrent networks.

Vision-language models introduced a separate challenge: the engine needed to execute dense convolution operations alongside the standard transformer layers in the same pass. The graph scheduler was overhauled to handle multiple disjoint sub-graphs, letting the image encoder process visual patches while the language model awaited the projected embeddings. The same approach extended to audio transcription, handling varied input modalities within a single inference graph.

Mixture-of-experts architectures presented a memory layout problem. Loading all experts into VRAM instantly exhausted consumer GPUs. The core team built an expert paging system: the allocator tracked activation frequency, pinned the most-used expert weights in VRAM, and offloaded the rest to system RAM. When the gating network selected an offloaded expert, the scheduler injected a PCIe transfer node to pull the weight tensor into a temporary compute buffer before the matrix multiplication.

By early 2026, llama.cpp had solidified its position as the leading local inference engine. Continuous integration now tested builds across dozens of hardware configurations, from Huawei's Ascend NPU through the ACL Graph backend to Qualcomm's Hexagon DSPs. New operations like element-wise unary functions and continuous repeat operations were added to Hexagon to support Qwen's DeltaNet linear attention layers. The introduction of optimized support for extremely low-bit quantization pushed the boundaries of edge deployment even further. Memory reorder optimizations for the SYCL backend achieved significant speedups on Intel Arc GPUs, proving that the engine was continually adapting to new hardware architectures. Advanced file descriptor-based model loading was merged, and continuous batching capabilities were refined for high-throughput concurrency.

The architectural philosophies of llama.cpp created distinct performance profiles compared to datacenter-focused inference servers. It demonstrated unmatched resilience in out-of-memory scenarios. Datacenter engines required the entire model graph and KV cache to fit strictly within GPU VRAM before initializing. If the memory limits were exceeded, the server crashed. Conversely, llama.cpp split layers between VRAM and system RAM without crashing. In a benchmark loading a massive model on a single consumer GPU, llama.cpp offloaded layers to the GPU and ran the remainder on the CPU. While the token generation rate dropped, the process completed successfully. This graceful degradation made llama.cpp the only viable engine for running frontier models on standard workstations.

Three years after Gerganov's first commit, a developer could run a frontier model on hardware they owned. That had not been true before, and llama.cpp was the primary reason it became true.