Technical

How We Isolate the Harmony Stem: A Technical Walk-Through

Priya Nair
Harmony stem isolation technical architecture for polyphonic music

When we talk about stem separation in music production contexts, the conversation usually focuses on the four canonical stems: vocals, drums, bass, other. This framing comes from the demixing research literature, where "other" is a catch-all for everything that isn't neatly categorisable. For production use, "other" is useless. Production teams need the harmonic content — the chords, pads, sustained textures — separated from both the melodic lead and the rhythmic content, as a controllable, clean layer.

Harmony isolation is harder than melody or rhythm isolation. Here's why, and here's what we've built to address it.

Why harmonic content doesn't separate cleanly from spectral analysis

Melody instruments and percussion have spectral properties that make them tractable for separation. A lead melody typically occupies a relatively narrow frequency band, moves in single-note sequences with distinct onset events, and has a characteristic harmonic series that repeats consistently with each note. Percussion — kick, snare, hi-hat — has short, broadband transients that are temporally isolated. Neither of these is clean in practice, but they have features that standard spectral masking approaches can latch onto.

Harmonic content — chords played by piano, pad synthesisers, strings, choir voicings, sustained brass — doesn't have those properties. It shares frequency space with almost everything else in the mix:

  • The fundamental and lower harmonics of a chord overlap with bass content in the 80–250 Hz range.
  • The upper harmonics of chord voicings overlap with the harmonic series of melodic instruments in the 500 Hz–4 kHz range.
  • Sustained pad textures have slow amplitude envelopes that look spectrally similar to room tone and reverb tails.
  • Polyphonic chord changes produce onset patterns that differ from single-note melodic motion but are not as temporally crisp as percussion transients.

A naive spectral mask built to preserve harmonic content will also preserve bass overtones and melody fundamentals. The result is a "harmony stem" with significant bleed from adjacent sources — which, when used in a post-production context, creates audible comb filtering when the full stem set is recombined.

Harmonic bleed and why it matters in practice

The specific production failure mode: a sound designer in a Wwise project brings in the harmony stem to reduce under a tense cutscene. The harmony stem contains low-frequency bleed from the bass line. When the sound designer applies a high-pass filter at 150 Hz to clean up the low end, the filter also affects the fundamental frequencies of the chord voicings — which happen to sit in the same range. The stem goes muddy. The designer doesn't know if this is a property of the harmony writing or a property of the isolation quality.

Clean harmony isolation means the harmonic content is attributable to the harmonic source, not to bleed from adjacent stems. When a post-production engineer EQs a clean harmony stem, they can be confident they're working with the actual harmonic material.

Our architecture: conditional generation with stem conditioning

The key insight in our approach is that we do not isolate harmony by separating it from a mixed signal. We generate each stem as an independent output conditioned on the shared musical context. This is a fundamentally different framing from demixing.

Demixing takes a mixed audio signal and attempts to invert the mixing process — to separate what was combined. This is mathematically ill-posed when sources share frequency space, which is why "other" in standard demixing models is so noisy.

Our model generates stems directly during the synthesis process, conditioned on a shared harmonic-rhythmic substrate. The harmony stem is generated with knowledge of what the melody stem contains — and that conditioning is used to avoid placing content in frequency regions already occupied by the melody. The stems are not independent draws; they are jointly generated to be orthogonal in the mixing sense.

In architectural terms: the generation passes for each stem share a common conditioning representation derived from the brief embedding and the global musical structure. Each stem-specific decoder has access to the outputs of all other stem decoders from the previous generation step, via cross-attention. The harmony decoder, when deciding what to place in the 2–4 kHz range, attends to what the melody decoder has already placed there and suppresses content that would create masking conflict.

Handling polyphonic chord voicings

The hardest specific case within harmony isolation is wide voicings in dense orchestral or hybrid electronic textures. A chord voiced across three octaves — say, a C major chord with bass C at 65 Hz, mid-register E and G at 330 Hz and 392 Hz, and upper-octave doublings above 1 kHz — occupies much of the useful audio spectrum simultaneously. Keeping those voicings coherent as a single stem while cleanly separating them from co-occurring melody notes at similar pitches requires the model to track harmonic identity across the frequency spectrum, not just spectral region.

We condition the harmony decoder on explicit chord-symbol representations derived from the brief's harmonic intent. The model learns to track which spectral content "belongs" to a chord versus which content belongs to a melody note — even when they're at the same fundamental frequency — by attending to the temporal and timbral context that distinguishes sustained chord tones from melodic passing notes.

This doesn't work perfectly for every case. Dense mid-tempo passages where a melody instrument is playing arpeggiated chord tones — effectively turning a melodic line into a harmonic statement — create genuine ambiguity that the model resolves with a probabilistic assignment, and the boundary between melody and harmony stems in those passages is imperfect. This is a known limitation of our current architecture, not something we've fully solved.

Frequency masking and the EQ headroom test

One practical test we use internally for stem quality: apply a +6 dB narrow-band EQ boost at five frequency points across each stem and listen for bleed artefacts from adjacent stems. A clean harmony stem, boosted at 500 Hz, should sound like an EQ'd chord pad — not like an EQ'd chord pad plus ghost melody notes. A clean melody stem, boosted at 2 kHz, should emphasise the lead instrument, not pull up pad harmonics.

We run this test at stem output QA. Stems that fail the EQ headroom test — where a boost at any frequency reveals significant content that doesn't belong to that stem's source — are flagged for model refinement. The test is simple and audibly interpretable, which matters when you're trying to track model behaviour across training iterations.

What this means for post-production workflows

A clean harmony stem with real EQ headroom changes several post-production tasks:

  • Adaptive music in game audio: The harmony stem can be brought in and out of Wwise state-based mixes without pulling up unwanted bass or melody content. A tension build that adds harmonic density is achievable without complicating the percussion and melody balance.
  • Music-under-dialogue mixing: Pulling the harmony stem down independently allows a sound mixer to thin the mid-range under a VO without touching the rhythmic or melodic character of the track.
  • Key and chord change in post: Some post tools (iZotope RX, Melodyne with polyphonic mode) can process a clean harmony stem to adjust chord voicings. This only works if the stem is clean — a bleed-heavy stem produces artefacts during pitch processing that are more audible than the original mix.

The goal is stems you can actually work with, not stems that technically exist as separate files. The distinction matters when the asset has to go through a professional post pipeline and carry its production value all the way to the final delivery format.