Latency in generative audio is not a single number — it is a pipeline with at least five distinct stages, each contributing to the delay between "submit brief" and "playable stems in DAW." We've spent the last quarter instrumenting every stage of our generation path and benchmarking against what production teams actually need. This post shares where those numbers sit today, where we've made gains, and where the hard constraints still are.
Defining the latency measurement problem
Most published latency figures for generative audio systems are cherry-picked under optimal conditions: short output durations, low-concurrency infrastructure, and measurement from model inference start rather than from API call initiation. This is not what a production team experiences. For practical purposes, end-to-end latency needs to be measured from the moment the brief is submitted to the moment the first playable stem file is available for download or streaming — including queuing, tokenisation, inference, audio rendering, stem separation, and delivery.
We measure across four tiers based on output duration: 15-second cuts, 30-second cuts, 60-second cuts, and 90-second cuts. Each tier has different inference cost and different practical use cases — 15-second cuts are relevant for ad tags and short-form social content; 60–90-second cuts are the standard for broadcast advertising and game ambience loops.
Current benchmark numbers by output tier
At median load (not peak, not zero-load), our current end-to-end latency figures across the four tiers are as follows. These are P50 numbers — half of generations complete faster, half slower.
For 15-second four-stem output: median 8.4 seconds. For 30-second four-stem output: median 17.2 seconds. For 60-second four-stem output: median 34.9 seconds. For 90-second four-stem output: median 51.6 seconds. P95 figures run roughly 2.3× the P50 — so a 60-second generation that completes in 35 seconds at median will occasionally take 80 seconds at the 95th percentile.
The relationship is roughly linear with output duration, which tells us that inference time is the dominant cost component and that we haven't hit any non-linear scaling problems yet in the 15–90 second range.
Where time is actually spent
The pipeline breakdown for a 60-second, four-stem generation at current P50 is approximately: brief embedding and tokenisation — 0.8 seconds; generation queue wait — 3.1 seconds; model inference — 24.2 seconds; audio rendering and stem separation — 5.4 seconds; packaging and delivery endpoint — 1.4 seconds.
Two things are notable here. First, the queue wait (3.1 seconds) is our highest-variance stage — it can drop to near zero at off-peak hours and spike to 25+ seconds during high-concurrency periods. Second, audio rendering and stem separation (5.4 seconds) is larger than we'd like. The inference stage generates a compact latent representation; rendering that to 24-bit WAV stems at 48 kHz requires a separate decode pass that we haven't yet parallelised effectively with the inference pass.
The real-time threshold and where we're not there yet
"Real-time" in audio production contexts has a specific meaning: generation completes in less time than the output duration. A 30-second generation completing in 17 seconds meets this threshold; that generation is faster than the output it produces. By this definition, we currently meet the real-time threshold for 15, 30, and 60-second outputs under median load. We do not meet it for 90-second outputs (51 seconds generation time versus 90 seconds output), and we do not meet it for any tier at P95.
This matters most for interactive use cases — live game audio adaptation where stems need to respond to game state within a fixed frame budget, or broadcast workflows where a director needs to audition and approve a cut before the session ends. For those use cases, the current numbers are good enough for 15–30 second cuts and require pre-generation caching strategies for anything longer.
Infrastructure changes that have moved the needle
Two changes in the past six months have improved our P50 inference time meaningfully. The first is speculative decoding applied to the generation pass, which reduced inference time for 60-second outputs by approximately 22% without measurable quality degradation. The tradeoff is higher GPU memory consumption, which constrains how aggressively we can scale concurrent generation at a given infrastructure cost.
The second change is a streaming delivery endpoint that begins sending stem data before generation is fully complete — the melody stem is rendered and delivered while harmony and rhythm are still being generated. This doesn't improve the time-to-full-stems metric, but it does improve the time-to-first-audible-stem from 34.9 seconds to approximately 11 seconds for 60-second output. For production workflows where the sound designer wants to start working with the melody immediately, this is a meaningful practical improvement even if the total latency is the same.
What we're working on next
The rendering and stem separation stage is our clearest near-term optimisation target. Moving stem separation to run in parallel with the tail of the inference pass — rather than sequentially after it — should reduce that 5.4 second component by roughly half. We're also investigating whether the queue wait can be reduced through better load prediction and pre-warming of inference instances ahead of anticipated peak periods.
The honest constraint is that inference time scales with model size, and model size correlates with output quality at our current architecture. We're not willing to trade quality for latency — a generation that completes in 10 seconds but sounds like a library cut is worse than one that completes in 35 seconds and sounds like a professional score. The benchmarks above reflect the point on that tradeoff curve we've chosen to operate at today. We'll publish updated numbers when the rendering parallelisation work ships.