Skip to main content
Open Access Publications from the University of California

UC Berkeley

UC Berkeley Electronic Theses and Dissertations bannerUC Berkeley

Tools for Creating Audio Stories

  • Author(s): Rubin, Steven Surmacz
  • Advisor(s): Agrawala, Maneesh
  • et al.

Audio stories are an engaging form of communication that combines speech and music into compelling narratives. One common production pipeline for creating audio stories involves three main steps: recording speech, editing speech, and editing music. Existing audio recording and editing tools force the story producer to manipulate speech and music tracks via tedious, low-level waveform editing. In contrast, we present tools for each phase of the production pipeline that analyze the audio content of speech and music and thereby allow the producer to work a higher semantic level.

Well-performed audio narrations are a hallmark of captivating podcasts, explainer videos, radio stories, and movie trailers. To record these narrations, professional voiceover actors follow guidelines that describe how to use low-level vocal components---volume, pitch, timbre, and tempo---to deliver performances that emphasize important words while maintaining variety, flow, and diction. Yet, these techniques are not well known outside the professional voiceover community, especially among hobbyist producers looking to create their own narrations. We present Narration Coach, an interface that assists novice users in recording scripted narrations. As a user records her narration, our system synchronizes the takes to her script, provides text feedback about how well she is meeting the expert voiceover guidelines, and resynthesizes her recordings to help her hear how she can speak better. In a pilot study, users recorded higher quality narrations using Narration Coach than using Adobe Audition, a traditional digital audio workstation (DAW).

Once the producer has captured speech content by recording narrations or interviews, she faces challenges in logging, navigating, and editing the speech. We present a speech editing interface that addresses these challenges. Key features include a transcript-based speech editing tool that automatically propagates edits in the transcript text to the corresponding speech track, and tools that help the producer maintain natural speech cadences by manipulating breaths and pauses. We used this interface to create audio stories from a variety of raw speech sources, including scripted narratives, interviews, and political speeches. Informal feedback from first-time users suggests that our tool is easy to learn and greatly facilitates the process of editing raw speech footage into a story.

After the producer edits the speech, she often adds a musical score to the story. We develop an algorithmic framework based on music analysis and dynamic programming optimization that enables automated methods for adding music to audio stories: looping, musical underlays, and emotionally relevant scores. The producer may have a short clip of music that she wants to use in the score; our looping tool allows her to seamlessly extend the clip to her desired length. The producer often uses musical underlays to emphasize key moments in spoken content and give listeners time to reflect on the speech. In a musical underlay, the music fades in to full volume at an emphasis point in the speech. Then the music plays solo for several seconds while the speech pauses. Finally, the music fades out as the speech resumes. At the beginning of the solo, the music often changes in some significant way (e.g. a melody enters or the tempo quickens). Our musical underlay tool automatically finds good candidates for underlays in music tracks, aligns them with the speech, and adjusts their dynamics. Full musical scores often reflect the emotions of the speech throughout the story. We present a system for re-sequencing music tracks to generate emotionally relevant music scores for audio stories. The producer provides a speech track and music tracks and our system gathers emotion labels on the speech through hand-labeling, crowdsourcing, and automatic methods. Evaluations of our looping, underlay, and score generation tools suggest that they can produce high-quality musical scores.

Combined, our tools augment the traditional audio story production pipeline by allowing the producer to create stories using high-level rather than low-level operations on audio clips. Ultimately, we hope that our tools enable the producer to devote more time to storytelling and less time to tedious audio recording and editing.

Main Content
For improved accessibility of PDF content, download the file to your device.
Current View