Game Audio

Adaptive Audio for a 40-Hour RPG: What We Learned Delivering Stems at Scale

Sundar Arvind
Layered audio stems branching and responding to game state changes

When a game studio came to us with a brief for a 40-hour RPG, the scope was immediately unlike anything we'd handled before. Not one score — an entire music system. Eighty distinct zones, each requiring a full adaptive music set covering exploration, low-tension, combat encounter, and boss combat states. Twelve ambient environmental states that needed to layer cleanly over the zone music without frequency collision. A total brief that, when we worked through the maths, required approximately 1,400 individual stem files delivered at 48 kHz, 24-bit WAV.

We said yes. Here is an honest account of what that actually involved, what broke, and what we'd do differently.

Brief architecture: why zone-level briefs don't scale

The studio's audio director initially wrote zone-level briefs — one brief per zone describing the biome, the narrative function, and the emotional arc. This is how you'd brief a composer, and it's a reasonable starting point. It does not scale to 80 zones in a generative audio context, because zone-level briefs don't capture the adaptive state relationships within each zone.

A zone brief that says "ancient forest, melancholic, orchestral" gives us enough to generate exploration-state music. It tells us nothing about how the combat state should relate to the exploration state melodically and harmonically — whether they should share a leitmotif, differ only in rhythmic density, or represent a distinct tonal departure. If we generate those states without a shared musical DNA specification, the transitions between states in Wwise feel discontinuous. A player moving from exploration to combat hears a genre shift, not a musical escalation.

We rebuilt the brief structure into a three-layer hierarchy: zone identity (biome, era, emotional register), state relationships (what musical elements persist across adaptive states, what changes), and stem separation requirements (which stems are required to carry the adaptive transitions — usually rhythm and harmony, with melody held constant across exploration and combat tiers). This added two weeks to the production schedule but produced stem sets that behaved correctly in the Wwise implementation without manual fixes.

Consistency at scale: the drift problem

Generating 80 zones in sequence produced a subtle but real problem: tonal drift. The first twenty zones had a clear shared musical language — the model's conditioning from the overarching world brief was strongly represented. By zone 60, that conditioning had been diluted by zone-to-zone specificity in the briefs, and the output started to diverge from the established world sound in ways that weren't intentional.

We caught this during a mid-production listening review when the audio director noticed that three zones in the northern region had a melodic character that felt more Mediterranean than the cold Nordic aesthetic established for that part of the world. The notes were correct, the instrumentation was correct, but something in the modal choices had drifted.

The fix was a world-anchor brief that we prepended to every zone brief during generation — a fixed reference block that described the game's overarching musical identity, key centre preferences, and forbidden timbral elements. This kept zone-to-zone output within a coherent musical universe while still allowing zone-specific differentiation. It added roughly 8% to inference time per generation due to the longer prompt, which at 1,400 files was not trivial, but the alternative was manual correction on a significant fraction of the output.

The stem count question: four is not always enough

Our standard delivery is four stems: melody, harmony, rhythm, and mix (a clean full mix for situations where the full track is needed without state-based layering). For this project, the Wwise implementation required six stems in some zones: melody, harmony, rhythm, percussion (separated from pitched rhythm content), bass (separated from full mix), and mix. The distinction between rhythm and percussion, and between bass and full mix, turned out to matter in the studio's middleware implementation because the audio programmer was using stem switching — not blending — for the combat transitions.

In a stem-switching context, each stem that persists across states needs to be isolated enough to carry its full role independently. A rhythm stem that contains bass content becomes problematic when the bass stem is also present in the low-intensity state but the rhythm stem is present in the high-intensity state — the bass appears, disappears, and partially reappears in a way that sounds like a mixing error rather than an intentional state change.

We added the percussion and bass stems for the 23 zones that used stem switching. This is now part of our RPG brief template: ask at brief intake whether the implementation uses stem blending or stem switching, because the stem separation requirements are fundamentally different.

Delivery logistics: naming conventions, versioning, and handoff

1,400 files delivered without a rigorous naming convention becomes an operational nightmare in a game audio pipeline. We had agreed on a naming pattern at project start, but the studio's implementation team had evolved their internal Wwise bank structure during development, and our file names didn't map cleanly to their bank naming convention by the time we were 60% through delivery.

We ended up writing a remapping script that translated our generation naming scheme into the Wwise bank structure, which the audio programmer reviewed and approved before we ran the full conversion pass. The lesson is obvious in retrospect: align naming conventions to the Wwise bank structure at brief intake, not halfway through delivery.

Versioning was the other operational gap. A zone score that needs revision after delivery — because a narrative beat changed, or the director's tone reference shifted — needs to be clearly tagged as v2 in a way that the implementation team can identify and swap in the Wwise project without manual cross-checking. We now deliver all files with a generation hash in the filename and a manifest JSON that maps each file to its brief version, so version tracking is explicit rather than implicit in the filename.

What we'd do differently

Three things. First, require a Wwise bank structure document at project kickoff — not after. The implementation architecture determines the stem separation requirements, and those need to be in the brief before any generation starts.

Second, build the world-anchor brief into the intake process as a standard deliverable. For any project covering more than 20 zones or more than 30 minutes of content, world-level musical coherence doesn't emerge from zone-level briefs alone. It needs explicit specification.

Third, run a mid-point listening review at 40–50% completion, not at 80%. By 80%, the cost of fixing drift is significant; at 50%, the correction is a brief adjustment and a regeneration pass on the second half. The production schedule needs to accommodate that review as a planned gate, not as an optional quality check that gets cut when the timeline compresses.

The project shipped. The audio director was happy with the output. But happy with the output is a lower bar than we want — we want the production process to be smooth enough that the team isn't spending engineering time on remapping scripts and mid-production brief restructures. That's what we're building toward.