The quality of what a generative music system produces is bounded by the quality of its input. This isn't a profound observation — it's the same constraint that applies to briefing a human composer. But the type of input that works for a generative model is different from what you'd send to a composer, and understanding that difference is what separates outputs you can actually use from outputs you need to regenerate four times before getting something acceptable.
We've processed a lot of briefs at this point, and the ones that produce tight first-generation results share specific characteristics. The ones that produce something generic — technically competent music that could soundtrack anything and therefore soundtracks nothing in particular — share a different set of characteristics. This is what we've learned about the difference.
What a model does with a brief
Before getting into what to write, it helps to understand what the model is actually doing with the text. Our generation architecture translates brief inputs into conditioning vectors that steer the decoder at the polyphonic synthesis level. Mood tags, tempo values, instrumentation cues, and harmonic descriptors each influence different aspects of the output — the rhythm stem emerges from different conditioning pathways than the harmonic stem, for instance.
What this means practically: a brief that gives strong conditioning signal in all relevant dimensions produces more constrained, on-target output. A brief that's sparse in some dimensions forces the model to use prior distributions — essentially, what music in this general category tends to sound like in the training corpus. "Upbeat, corporate" is a very sparse brief. The model will generate something that sounds like upbeat corporate music in aggregate, because that's what it learned to associate with those tokens. The output will be fine and completely forgettable.
Specific conditioning beats vague conditioning. This is the core principle everything else follows from.
The three inputs that do the most work
Tempo as a hard number, not a feeling
Writing "fast" or "energetic" for tempo is almost useless. Fast means 120 BPM to one person and 160 BPM to another. More importantly, a 20 BPM difference in tempo produces fundamentally different music rhythmically — the subdivision feel changes, the appropriate instrumentation changes, the relationship between the beat and the harmonic rhythm changes.
Give a BPM. If you don't know the exact number, give a range: "118-124 BPM." If the tempo needs to match a cut, count it or use a tap-tempo tool. A 30-second spot with 72 picture cuts won't feel right with music at 90 BPM if the visual rhythm is built around 128. These are not interchangeable.
Instrumentation exclusions, not just inclusions
Most briefs list what instruments they want. "Acoustic guitar, light percussion, piano." This is useful, but what's often more useful is what you don't want. "No synth pads. No brass. No vocal elements." Exclusions constrain the harmonic and textural space aggressively, which tends to produce a more distinctive sound than inclusions alone.
This is particularly important for stems. If you're generating for a game that has a specific sonic identity — string quartet plus electronics, no drums, no vocals — you need those exclusions specified explicitly. The model will try to produce something broadly appealing in the genre unless you close off that space. Closing it off via exclusion language is one of the more reliable steering tools available in brief writing.
A structural directive for the arc
Music in production contexts almost always has a structural function: it needs to build, resolve, stay flat, or shift at a specific moment. "Builds from sparse to full in the second half" is actionable. "Starts with just piano, percussion enters at 20 seconds, melody resolves at 28 seconds" is more actionable still.
For adaptive game music, structural directives might reference loop point design: "must loop cleanly at 32 bars, with a separate 8-bar intro section." For advertising, they might reference picture edit timing: "sustained hit at 28 seconds to match product reveal." These structural anchors are significant conditioning inputs — they directly shape the rhythmic and harmonic phrasing of the output.
The reference track — and what to take from it
Reference tracks are one of the most powerful brief inputs and one of the most commonly misused. The mistake is to provide a reference and treat it as a style directive in totality: "make it sound like this." What actually helps is dissecting what about the reference you want to carry across.
Useful extraction from a reference: tempo, key signature or harmonic character (major with added 7ths, minor pentatonic, modal), production approach (dry and close versus ambient and reverb-heavy), specific timbral qualities (the attack transient on the percussion, the type of bass tone), energy arc (where the track peaks relative to its total length).
Not useful: "the vibe." "The feeling." "Like this but different." These have no operational meaning in brief terms. They require a human-to-human creative conversation to decode, which is precisely the interaction that a generation workflow is trying to remove from the critical path.
What doesn't need to be in a brief
Longer isn't better. Briefs that pad out emotional description in multiple paragraphs — "we want the audience to feel the warmth of a summer evening but also the bittersweetness of time passing, something contemplative but not sad, hopeful but grounded" — don't produce better output than briefs that say "G major, 88 BPM, fingerpicked acoustic guitar, no percussion, 45 seconds, resolves on tonic." The emotional narrative in the first example is harder to condition from than the concrete parameters in the second.
We're not saying the emotional intent doesn't matter — the emotional goal is the point of the whole exercise. But translate it into parameters before you write the brief. What tempo carries that emotion? What key? What instrumentation is doing the emotional work? The translation step is yours to do; the model isn't equipped to run that step for you reliably from free-text emotional description alone.
A worked example
Here's a brief that came in for a food brand Instagram campaign — a 15-second spot for a premium olive oil, cutting together close-up pouring and cooking shots, final frame on the label:
"Mediterranean feel, acoustic, warm. 15 seconds."
And here's the same brief rewritten to condition effectively:
"15 seconds. 96 BPM. Nylon-string classical guitar, sole instrument. E major. No percussion, no bass, no synth elements. Sparse fingerpicking in first 8 seconds building to fuller chord work in second half. Resolves cleanly on bar 8. No reverb tail beyond 0.3 seconds — dry, close recording character."
The second brief takes 45 seconds longer to write and produces output that needs one revision pass instead of four. The delta in effort is front-loaded — it's in the brief, not in the regeneration loop.
When briefs have gaps we can't fill
There are cases where even a well-structured brief doesn't produce what the production needs, and it's worth being honest about when that is. If the brief requires a very specific melodic motif — a jingle, a signature phrase, a theme that needs to carry across multiple pieces — generation from a text brief alone isn't going to nail that. Melodic identity is hard to specify textually; it's an iterative, auditory creative process. Brief-based generation works well for establishing sonic character, feel, texture, and structure. It works less well for prescribed melodic content that needs to be exactly this tune.
For game audio where the brief includes stem layer requirements for Wwise or FMOD integration, specify those stem requirements in the brief explicitly — including how many layers you need and what functional roles they serve. The generation system needs to know you want four separable stems, not just a coherent piece of music, to produce an output that's useful for the middleware workflow.
Brief quality is the variable you control most directly in the generation workflow. The time you spend tightening a brief is recovered many times over in the generation loop.