There is a version of this conversation that sounds pedantic: "always use WAV, MP3 is bad." That framing misses the actual problem. MP3 at 320 kbps sounds fine on first listen. The issue is not the first listen. The issue is what happens when that audio goes through three rounds of conform, recompress, and export in a broadcast post pipeline — and what your editor discovers at two in the morning before the air date.
I work on the audio generation side at Mozrat AI. We think carefully about output format because the stems we produce land in professional editing timelines, and format decisions made at generation time have consequences downstream that aren't always obvious until they become someone's emergency.
What MP3 encoding actually does to audio data
MP3 is a perceptual codec. The encoder analyses a signal and discards frequency content that psychoacoustic models predict you won't hear: transient detail masked by louder simultaneous sounds, high-frequency content above roughly 16 kHz at lower bitrates, spatial information in the stereo field. The result is a smaller file that sounds perceptually similar to the original — not identical.
The key word is perceptual. The discarded data is gone. You cannot recover it by converting back to WAV. An MP3-to-WAV conversion gives you an uncompressed file that contains the post-compression signal — same fidelity ceiling, larger file size.
What gets discarded matters in post-production contexts:
- Transient smear. Percussive attacks — snare hits, pizzicato strings, plucked guitar — lose definition at the leading edge. When you're cutting to picture, the transient is often what you're locking to. Smeared transients create cut points that feel slightly wrong without the editor immediately knowing why.
- High-frequency artefacts. MP3 encoding introduces pre-ringing and ringing artefacts around sharp transients, particularly audible in the 8–16 kHz range. They're subtle in isolation. They become audible when the track is EQ'd or when the room changes the listening environment.
- Stereo imaging degradation. Joint stereo encoding, used by most MP3 encoders at mid bitrates, processes mid and side channels separately and can collapse spatial width in frequency bands it deems "similar enough." On dialogue-under-music mixes, this changes how the music sits in the stereo field relative to dialogue.
Generation compression versus delivery compression
Here is the specific failure mode we see when teams deliver MP3 music stems into professional post workflows.
The track goes in as MP3. The editor works with it in an AVID Media Composer or Adobe Premiere timeline. The project gets exported as an AAC master for the agency review cut — that is compression generation two. The agency sends notes. The editor conforms the sequence, re-exports with the revised cut — generation three. The final deliverable goes through a broadcast encode, or a YouTube transcode, or a streaming platform master — generation four or five.
Each generation of lossy encoding does not start from the original signal. It starts from whatever came out of the previous encode. Codec artefacts accumulate. The specific failure mode: MP3 pre-ringing and ringing artefacts land in frequency bands that the next-generation codec (typically AAC) treats as genuine signal — and encodes, imperfectly. What started as a subtle shimmer on a hi-hat at generation one is an audible metallic ring by generation four.
This is not theoretical. Every broadcast post facility has a policy of working in lossless formats for this reason. The 1980s-era rule "never dub an analogue tape more than three generations" has a direct digital-era equivalent: never chain lossy codecs.
The 24-bit/48kHz baseline and why it matters for stems specifically
Professional post-production works at 24-bit/48kHz as a floor. Not 16-bit/44.1kHz (CD quality), not 24-bit/44.1kHz. The 48kHz sample rate is the broadcast and film standard; 44.1kHz is the CD and consumer streaming standard. Using 44.1kHz audio in a 48kHz project requires sample rate conversion — and even high-quality SRC introduces subtle aliasing at the nyquist boundary.
The bit depth argument for stems is different from the argument for a final master. In a stem — which is a single isolated layer, not a full mix — the dynamic range of 24-bit matters during processing. When you apply EQ, compression, or volume automation to a 16-bit stem in a 32-bit floating-point processing environment, you have headroom for that processing. Push a 16-bit source below around -48 dBFS during a quiet section and you're operating in the last two or three bits of resolution — where quantisation noise becomes relevant.
We deliver Mozrat AI stems as 24-bit/48kHz WAV. That is not a premium option; it is the default. The reason is straightforward: we don't know what post-production environment the stems are going into, and 24/48 WAV is the one format that creates no conversion problem regardless of target delivery spec.
When MP3 is actually acceptable
We're not saying MP3 is categorically unusable in a professional context. The cases where it works:
- Preview and reference only. Sending an MP3 for client approval before committing to a full stem set is completely normal. The problem occurs when the approved MP3 gets handed directly to the editor as the working asset.
- Single-generation delivery to a lossy endpoint. If you are producing a podcast episode that will be encoded to MP3 128 kbps for distribution and there is no intermediate conform step, starting from a 256 kbps or 320 kbps MP3 source adds minimal artefact accumulation. This is a narrow case.
- Archive copies of final mixes — not stems, not session assets. Once you have the lossless master, compressed copies for storage or distribution are fine.
The line is: MP3 for distribution endpoints, WAV for working assets. When music enters a post-production timeline as a working asset, it should be WAV.
A specific scenario: the broadcast deadline
Consider a 30-second TV commercial with a 10-day post schedule. The agency brief calls for an emotional underscore — minimal piano, sustained pad, no percussion. Mozrat AI generates the track and delivers four stems: piano, pad, bass, textural ambience, all as 24-bit/48kHz WAV.
On day seven, the client asks for a version with the piano dropped down in the final five seconds to land harder on the VO. Because the editor has the piano stem separately, this is a two-minute automation change. The conform is clean; the stems are lossless; there are no artefacts introduced by the edit.
Now run the same scenario with the music delivered as a single MP3 mix. The editor cannot isolate the piano. "Dropping the piano" becomes either a surgical EQ notch (which affects everything at piano frequencies, including the VO) or a full music replacement. A two-minute change becomes a half-day negotiation with the composer about a revision — which, on day seven of a ten-day schedule, is where campaign timelines start to collapse.
The format question and the stem question are related. Both are about preserving optionality in post. WAV stems give the editor maximum flexibility. MP3 mixes give the editor neither.
What to check before accepting a music delivery
If you receive music assets from any source and you're putting them into a professional post pipeline, check:
- File format. WAV or AIFF lossless, not MP3 or AAC. A file renamed to .wav that was previously MP3 is still MP3 inside — check the actual sample format in your DAW's info panel or in a tool like MediaInfo.
- Sample rate. 48kHz for broadcast/film projects. 44.1kHz for music-only delivery. Mismatch requires SRC; know what SRC algorithm your NLE uses.
- Bit depth. 24-bit for working assets. 16-bit is acceptable only for final delivery to endpoints that require it (some broadcast specs still call for 16-bit PCM).
- Stem count and labelling. Confirm stems are genuinely isolated layers, not submixes or stems-of-stems with bleed from adjacent layers.
These checks take thirty seconds and prevent the two-in-the-morning artefact discovery. The format decision happens before the creative decision. Get it wrong and the creative work becomes harder regardless of how good the music is.
At Mozrat AI, every output goes through a lossless quality gate before delivery. Not because it is difficult — it is not — but because the downstream consequences of getting it wrong fall on editors and sound designers who have no control over what they receive. The format is the foundation. Build it right.