What Are Stems? Why Production Teams Need Separated Tracks, Not Mixed Masters

When I started in music licensing, I spent a lot of time explaining to advertising producers why the track they'd licensed wasn't actually usable in the cut they'd received from the editor. The file was there. The music was in it. But it was a stereo mixdown — a single WAV with every instrument baked together — and the edit required pulling the vocal hit to land three beats before the pack shot, not on it.

That problem has a name: they had a master, and they needed stems. Getting to grips with the difference between the two isn't just an audio nerd issue. It determines whether post-production on a given project takes half a day or four days.

What a stem actually is

A stem is an isolated audio layer — one discrete element of a piece of music exported as its own file, at the same sample rate and bit depth as the master, fully synchronised to it. A typical music stem package for an advertising track might include:

Drums and percussion — the rhythmic foundation, kick, snare, hi-hats, any percussion loops
Bass — sub-bass and bass guitar or bass synth isolated from everything else
Melody — the lead instrument or lead vocal line
Harmony/pads — chords, ambient layers, supporting instrumentation
FX/atmospheric — risers, transitions, ear candy that doesn't belong in any of the above categories

Stack all of those together and you get the master. The key point: the master is the output. The stems are the working materials that produced it.

In traditional music production, the composer or producer keeps those working files. They may export stems as a deliverable if you ask — and pay — for them. In library music, this is often called a "stems license" or "production version," and it carries a premium over a sync-and-master license precisely because you're getting control.

Why the mixed master is almost never enough

Advertising editors and post-production teams routinely need to do things with music that are impossible — or at minimum destructive — without access to stems:

Timing the music to picture

An edit seldom runs at the length the composer wrote for. A 45-second cut becomes a 30-second version late in the production process. With a stereo master, your options for shortening are limited: hard cuts (which audibly break the music), pitch-shifted speed changes (which sound exactly as bad as they sound), or asking the composer for a 30-second re-edit. With stems, an experienced audio editor can restructure the arrangement around the picture in a fraction of that time.

Ducking under dialogue and VO

When you have a voice-over running under music, you need the music to sit under it. With a stereo master, you duck the whole thing. With stems, you can reduce the melody and harmony stems — which occupy the midrange frequencies where speech also lives — while keeping the drums and bass at a fuller level. The result is music that still sounds like music under the voice, rather than a muffled background wash.

Reacting to last-minute version changes

Consider this scenario: a 30-second insurance brand spot, three rounds of client amends, final version signed off on a Thursday evening. Post-production was working to the 30-second master. Late Thursday, the client comes back asking for a 15-second cutdown for pre-roll. Without stems, the editor either cuts brutally or calls the composer at 9pm. With stems, the 15-second version is built from the same materials as the 30-second — same drum hit, same melody entrance, same feel — just restructured. The difference in that workflow is typically two hours versus a full production day.

The bit depth and sample rate question

This is worth stating clearly because it gets confused in briefs all the time: stems should match the session format. If you're working in Pro Tools at 24-bit/48kHz — which is standard for broadcast delivery in the UK — your stems should be 24-bit/48kHz WAV files. Not 16-bit. Not 44.1kHz MP3s of the stems.

Why does this matter? Because stems often go through additional processing once they're in the editor's session: EQ, compression, automation, reverb sends. Every stage of lossy processing compounds degradation. A 24-bit WAV gives you 144dB of dynamic range to work within; a 16-bit file gives you 96dB. For most delivery formats that difference is inaudible, but when you're stacking multiple stems and applying processing, you want headroom. We deliver everything at 24-bit/48kHz by default because that's what professional sessions are built around.

What "no stems" actually costs a production

The cost is rarely the license fee. It's the revision cycle. If you have a mixed master and the editor needs the music to behave differently, you go back to the composer or the library. That process typically takes 24-72 hours — you raise the request, the composer schedules it, they rework the arrangement, they re-export, you receive and conform. On a two-week production schedule, that eats half a week.

We're not saying every project needs a full stem set. A 30-second online-only spot with a fixed lock that runs once and is never revised might be perfectly well served by a stereo master. But if there's any chance the edit will change — and in advertising, the edit almost always changes — stems are the insurance policy that costs almost nothing relative to the production budget but saves an outsized amount of calendar time.

Where stems come from in an AI generation workflow

The interesting thing about generating music through a model rather than commissioning a human composer is that stem separation doesn't need to be an afterthought or an add-on. Our generation pipeline produces stems natively — the model outputs individual layers rather than a mixed master, because polyphonic synthesis at the track level is how the architecture works. The melody stem, harmony stem, rhythm stem, and atmospheric layer each emerge from separate decoder paths.

That doesn't mean the problem of stem generation is solved. Harmony isolation in particular is genuinely hard — chords and pads bleed into melodic frequency ranges, and separating them cleanly requires more than frequency analysis. But the key point is that a model-first workflow doesn't have the same structural reason a human composer has to deliver a master and keep the stems: there's no proprietary session to protect, no Pro Tools session to hand over. Generation and stem delivery can be the same operation.

A note on stem labelling

One thing that causes real confusion on delivery: stem file names. We've seen stem sets delivered as Track01.wav, Track02.wav, Track03.wav with no indication of what's in each. By the time the editor discovers that Track02 is actually the percussion bus and Track03 is the atmospheric layer, they've spent twenty minutes auditioning files individually.

Standard practice: name stems by element and instrument family. spotname_drums_48k24b.wav, spotname_melody_48k24b.wav, spotname_harmony_48k24b.wav. Include the sample rate and bit depth in the filename. This sounds obvious, but the number of stem deliveries that arrive without any of this would surprise you. When we package stems for delivery, the naming convention is baked into the export — the editor should be able to drop files into their session without any audition required to understand what they have.

The stem is the atom of modern production audio. Everything downstream — adaptive game music, podcast ducking, ad recuts, broadcast versioning — depends on having those layers separated before you need them, not after.