Mixture-of-experts models, explained simply

Mixture-of-experts lets a model be huge yet cheap to run by using only a slice of itself per input. Here is the idea, plainly, and why it matters.

models2026-04-11 13:35 KST·Lead Editor·7 min read

You may have seen a model described as having an enormous parameter count while somehow being relatively cheap and fast to run. On the face of it that seems contradictory — more parameters usually means more compute. The resolution is an architecture called mixture-of-experts, or MoE, and it has become one of the most important ideas in how large models are built. The concept is simpler than the name suggests, and understanding it clears up a real puzzle: how a model can be "huge" and "efficient" at the same time. This piece explains it in plain language.

The problem MoE solves

Start with a basic tension. Making a model more capable usually means making it bigger — more parameters, more capacity to learn. But every parameter a model uses costs compute and memory each time you run it. In an ordinary model, all the parameters are engaged for every input. So scaling up capability scales up the cost of every single request in lockstep. Past a certain point, that becomes prohibitively expensive.

The question MoE asks is: do we really need the entire model active for every input? Most inputs only call on some of what the model knows. A question about poetry and a question about chemistry do not obviously need the exact same machinery. What if the model could be enormous in total, holding a great deal of specialized capacity, but only switch on the part relevant to each particular input?

That is the whole idea. MoE decouples the total size of a model from the amount of it used per request.

The core idea: many experts, a few used at a time

In a mixture-of-experts model, certain layers are split into many parallel sub-networks called experts. Instead of one big block that processes every input, you have a set of smaller blocks sitting side by side. For any given piece of input, only a few of these experts are activated; the rest stay dormant for that input.

So the model contains, say, a large number of experts in total — that is where the big total parameter count comes from — but only a small handful do work on any particular token. The total capacity is huge; the active capacity per input is modest. This is exactly why such a model can advertise a very large parameter count while running at a cost closer to a much smaller dense model.

A note on the word "experts": it is easy to imagine each one as a tidy specialist — this one handles math, that one handles French. Reality is messier. The experts specialize in ways that emerge from training and rarely map onto clean human categories. They are better thought of as different learned pathways the model can route through, not labeled departments.

The router: deciding who handles what

If only a few experts run per input, something has to decide which few. That something is a small component usually called the router or gating network. For each piece of input, the router looks at it and picks the experts best suited to handle it, sending the input to those and skipping the rest.

The router is itself learned during training, alongside everything else. Nobody hand-assigns inputs to experts. The model discovers, over the course of training, a useful way to divide work — which experts should handle which kinds of input — and the router encodes that learned routing decision. When you send a request, the router is quietly making these choices token by token, directing each through its chosen experts.

This routing is the heart of the design, and it is also the trickiest part to get right, which leads to the trade-offs below.

Why this is worth the trouble

The payoff of MoE is a better deal on the central trade-off of model building: capability versus cost.

Because total capacity and per-input compute are decoupled, an MoE model can hold far more knowledge and skill than a dense model of comparable running cost. You get something close to the capability of a very large model at something closer to the running cost of a much smaller one. In a world where the expense of running models at scale is a real constraint, that is a powerful lever. It is a large part of why MoE designs have become common in capable, efficient models — they offer a way to keep scaling capacity without scaling the cost of every request to match.

The trade-offs and difficulties

MoE is not free efficiency. It introduces its own complications, and understanding them keeps your mental model honest.

Routing has to be balanced. If the router learns to over-favor a few popular experts, the rest are underused — wasted capacity — and the busy ones become bottlenecks. Training has to actively encourage a healthy spread of work across experts, and getting this balance right is genuinely hard.
Memory cost stays high even if compute drops. Only a few experts run per input, so the compute per request is modest. But all those experts still have to be loaded and available, so the memory footprint reflects the full, large model. MoE saves on computation more than on the resources needed to host the model.
Training is more complex. Coordinating many experts and a router, keeping the routing balanced, and getting the pieces to work together smoothly is harder than training a single dense model. The efficiency at run time is bought with added complexity at build time.

So MoE is a sophisticated trade, not a strict upgrade. It buys cheaper inference and greater capacity at the price of trickier training and a large memory requirement.

How this affects you as a user

Most of this is invisible when you actually use a model — you send text and get text back, and the routing happens silently underneath. But the architecture explains a few things worth recognizing. It is the reason a model can quote an eye-popping total parameter count while being surprisingly affordable to run, which matters when you are comparing models by size and cost. It is why a quoted "total parameters" figure and an "active parameters per input" figure can differ dramatically for the same model — and why the active number is often the more honest guide to running cost. And it is a reminder that headline size numbers need context: with MoE in the picture, two models with the same total parameter count can have very different real-world economics.

The takeaway

Mixture-of-experts is the answer to a simple question: must the whole model run for every input? By splitting parts of the model into many parallel experts and using a learned router to activate only a few per input, MoE decouples a model's total size from the cost of using it. That lets a model be enormous in capacity yet relatively cheap to run, which is why so many capable, efficient models are built this way. The catch is that the experts must be kept evenly used, the full model must still fit in memory, and training is more involved. You will rarely interact with any of this directly, but it explains the otherwise puzzling combination of "gigantic" and "affordable," and it is the reason total parameter counts alone can be misleading.

#mixture-of-experts#architecture#efficiency#scaling

Primary sources

Hugging Face — Documentation arXiv — e-Print archive