The hidden architecture of generative video is sound
When we talk about generative filmmaking, the focus is almost always on the visuals. But the frame is only half the story. A beautiful generated landscape can easily fall flat if the audio feels generic or empty. In my recent projects, I've built custom narration, ambient cues, and original scores with AI audio tools in my post-production pipeline to speed up the edit.
Narration on the timeline: my experience with the Mad Man and ari campaigns
I integrated voice synthesis narration into two client deliverables: the Mad Man commercial and the Introducing ari brand reveal film. The voice quality was clean, but getting it client-ready was not a one-click process. It took dozens of generation cycles to dial in the right pacing, pauses, and tone. For the Mad Man commercial, the brand wanted a voice with a deep, confident register. The initial generations were far too flat. Instead of just regenerating the same voice, I used ElevenLabs' Voice Design tool to create entirely new custom voices from scratch, experimenting with different ages, accents, and gender balances until I found the right tone. Once the voice identity was locked, I spent hours adjusting stability and similarity sliders, regenerating individual lines of dialogue over 30 times, and adjusting punctuation (like adding ellipses or hyphens) to force natural breathing points. When it sat correctly in the mix with the music and sound effects, the client approved the spot without realizing the voice was AI-generated. For the ari project, I ran a similar workflow. The AI narration was the foundation of the edit, so I could pace the visual morphs exactly to the vocal cadence. While it took multiple passes to find the right tone, being able to iterate quickly saved a lot of time during the rough-cut phase.
Iterating for emotional nuance
The biggest challenge is emotional range. Because the software generates audio based on statistical patterns, getting a specific word to land with a quiet emphasis or a subtle pitch shift takes a lot of attempts. Even then, the reading can feel too flat for long stories. On the ari project, the voice worked well for the informational parts, but it couldn't deliver the final line with the warmth I wanted. I went through about 40 generations for that single sentence, adjusting settings and trying different models. It taught me that while AI audio works great for explainers and tech showcases, a human voice actor is still irreplaceable for emotional brand stories. Additionally, matching a specific client voice remains a challenge. A voice clone gets close, but it drifts over longer scripts. You have to mix it carefully and regenerate specific lines to keep the voice consistent.
Sound design prototyping: sketching textures in Resolve
For the ari campaign, I experimented with generative sound effect tools to quickly prototype the sound design. I needed abstract, high-tech swooshes and digital hums to match the logo transitions. Instead of hunting through libraries, I generated quick sound textures to block out the timing directly in the DaVinci Resolve timeline. While these sounds worked well as timing markers and placeholders, they needed a lot of work in post. Many of them had digital noise or lacked the deep punch you need for client work. In the final mix, I layered them with high-quality library samples. The real value was speed: I could lay down the rhythm instantly, then refine the details manually.
Generative music and dynamic pacing
Finding the right track to set the edit pace can be a bottleneck. I use generative music tools to create custom music tracks that match the exact BPM and mood I want, locking the edit structure early in the project. Unlike standard temp tracks that get replaced, I often keep and mix these AI-generated tracks directly into the final delivery. While some AI music can feel repetitive, adjusting the structure and layering it with traditional sound design allows it to stand as a permanent part of the score, offering a custom fit that traditional library tracks can't match.
A hybrid audio workflow for high-end post-production
Across both the Mad Man and ari campaigns, the core lesson was that AI audio tools work best as assistants rather than doing all the work alone. The AI handles the initial time-consuming drafts, while human direction handles the final mix, precision timing, and emotional resonance. The AI narration sets the rhythm and pacing of the edit. This gives the client a sense of the flow early on. Then, we refine the audio landscape in Resolve, layering the sound effects and replacing temp elements with polished, high-fidelity library assets and licensed scores. This workflow lets me edit twice as fast in pre-production, without hurting the quality or safety of the final delivery. The AI sets the foundation, but the human editor is what makes the project work.
The updated post-production stack
Where AI audio fits in the stack: - **Voiceover design**: Creating custom voices to test pacing against visual drafts and establish a unique brand tone. - **Sound design placeholders**: Generating quick effects to lock keyframe timing in motion graphics. - **Generative music integration**: Creating custom music tracks to set the pace and mood, mixing them directly into the final score. Where human work is still required: - **Emotional voice acting**: Working with voice actors for brand stories that need real feeling. - **Final sound mixing**: Layering clean sound libraries and mastering the final mix to build a professional soundstage. - **Live instrumentation or custom score refinement**: Commissioning original compositions or human performance for projects requiring complex orchestrations.

