Running LLMs locally: a practical primer for a single laptop
You can run a capable open-weight model on one laptop today. Here is what actually determines whether it works — memory, quantization, tooling — and honest expectations for each.
Running a language model on your own machine used to be a research-lab exercise. It is now a weekend project. The reason is not a single breakthrough but the maturing of open-weight models and the tooling around them. This primer skips the hype and explains what actually governs whether local inference works for you: memory, quantization, tooling, and a realistic sense of what you gain and give up.
Why run locally at all
Three honest motivations, in roughly the order they hold up:
- Privacy and control. Nothing leaves your machine. For sensitive notes, drafts, client documents, or internal code, that is a real advantage no cloud setting fully replicates. There is no data-retention policy to read because there is no data leaving.
- Cost at steady volume. If you make many calls and already own the hardware, marginal cost is roughly your electricity. For bursty or low volume, a hosted API is usually cheaper once you count your time setting things up.
- Learning and tinkering. Running a model locally demystifies it. You see the trade-offs directly — memory, speed, quality — instead of behind an API that hides them.
What you should not expect: the ceiling of the largest hosted models. Local models have closed much of the gap for everyday tasks, but the hardest reasoning and the longest-context work still favor frontier hosted systems. Going in with that expectation is the difference between being impressed and being disappointed.
The one number that matters most: memory
The single biggest constraint is how much memory the model's weights occupy. As a rule of thumb, a model's footprint scales with its parameter count times the bytes used per parameter. Full-precision weights are large; the trick that makes laptops viable is quantization — storing weights at lower precision (for example 4 bits instead of 16) to shrink the footprint several-fold.
What this means in practice, in plain terms:
- A small model (a few billion parameters) quantized to 4-bit fits comfortably in the memory of a modern laptop and runs at usable speed.
- A mid-size model (roughly the 7–14B range) quantized is the sweet spot for many laptops with adequate unified or GPU memory.
- Larger models are possible but get slow or simply will not fit without serious hardware.
On Apple Silicon, unified memory is shared between CPU and GPU, which is why those machines punch above their weight for local inference — the GPU can address a large pool of memory. On a typical Windows or Linux machine, the GPU's dedicated memory is usually the limiting factor, and a model that exceeds it either spills to slower system memory or fails to load.
Quantization: the trade-off in one section
Lower precision means smaller and faster, at some cost to quality. The good news is that the loss from 16-bit down to around 4-bit is, for most everyday tasks, smaller than people fear — often barely noticeable in normal use. Push much below 4-bit and degradation becomes obvious: the model starts to lose coherence, miss instructions, or repeat itself.
The practical advice is a simple rule: start at a 4-bit quantization of the largest model that fits, and only move up in precision if you can measure a quality problem on your own tasks. Most people never need to. Chasing higher precision "just in case" usually buys nothing but a slower model and a bigger memory bill.
The tooling, briefly
You do not need to compile anything from scratch. Two mature, well-documented options dominate:
- llama.cpp — a lean, fast inference engine that runs quantized models efficiently across CPU and GPU on every major platform. It is the foundation many other tools build on, and worth knowing about even if you use a wrapper.
- Ollama — a friendlier layer that handles downloading, quantization formats, and a local server with a few commands. For most people starting out, this is the path of least resistance.
Models themselves come from open hubs such as Hugging Face, where open-weight releases are published with their licenses attached. Read the license before any non-personal use — "open weights" does not always mean "free for commercial use," and the terms vary more than people assume.
A realistic first run
A sane starting sequence, without committing to specific versions that age out:
- Install Ollama (or build llama.cpp if you want more control).
- Pull a small, well-regarded open-weight model in a 4-bit quantization.
- Ask it the kind of question you actually care about — not a trivia test. Watch the tokens-per-second and judge whether the speed is usable for your workflow.
- If quality is short, step up the model size before stepping up precision. If speed is short, step down size.
- Once something feels usable, save the exact model and settings. Reproducibility is half the battle.
Where local inference quietly breaks
Three failure modes to expect so they do not surprise you:
- Context length costs memory too. Long inputs consume memory on top of the weights. A model that loads fine on a short prompt can still run out of room on a long document, and the failure can look like a crash rather than a clear "out of memory."
- Throughput is not latency. A model can feel fast on a short reply and crawl on a long one. Always measure on the output length you will actually use, not on a one-line greeting.
- The first run is the slow run. Initial load, and sometimes the first generation, includes setup the later ones skip. Judge speed on the second and third runs.
When not to bother
Local inference is not always the right answer, and it is worth being honest about that. If your volume is low and sporadic, a hosted API is cheaper and faster to set up. If you need the absolute top of capability or very long context, hosted frontier models still lead. And if you would spend more time maintaining the setup than using it, the cloud is doing you a favor. Local makes sense when privacy, steady volume, or curiosity tips the balance — not as a default.
A quick word on security
Running locally removes the network-transmission risk, but it does not remove all risk. Download models and tools from reputable sources, mind the licenses, and remember that a local model can still produce wrong or unsafe output — "local" describes where it runs, not how much you should trust what it says.
The takeaway
Local LLMs are no longer exotic. The decision comes down to three honest questions: does the model fit in memory, is the quantized quality good enough for your task, and is the speed usable for your workflow? Answer those on your own machine with your own prompts, and you will know within an afternoon whether local inference belongs in your toolkit — far more reliably than any benchmark or blog post could tell you, this one included.
