In search of the Holy Grail of audio and video editing

Imagine editing audio and video by manipulating text in a corresponding transcript; that day is almost here

Film reel with movie ticket    87483259
Credit: Thinkstock

In "A better way to capture team meetings," I explored Google's Hangouts on Air as a tool for annotating a videoconference. If you're the live or post-facto scribe for a meeting, it's a great way to contextualize your summary with pointers to the relevant parts of the discussion. Scrubbing around in the video and capturing "Get video at current time" links couldn't be much easier. But video is an opaque data type that we can't easily scan or search.

Text, on the other hand, is eminently scannable and searchable. Given a high-quality transcript with embedded timecodes that align individual words to corresponding points in the video, you could use that transcript as an interface to an audio editor. That's something I've long imagined, but won't experience anytime soon. Right?

Google's autogenerated captions can't (yet) deliver that Holy Grail. But over the weekend, at the I Annotate conference (disclosure: my company, Hypothesis, is the organizer and host) I realized that a textual interface to audio and video editing is closer than I thought.

Laurian Gridinoc, a computational linguist, Knight-Mozilla OpenNews Fellow, and developer at Hyperaud.io, showed me some inspiring uses of technology his company has developed. Check out this BBC prototype. It divides the browser into two panes. On the left you play Elizabeth Klett's Librivox recording of Charlotte Brontë's "Jane Eyre" as audio with synchronized text. You can select phrases, sentences, or paragraphs; copy them into the right pane; play the newly remixed audio (with synchronized text); and export the remix. 

If you haven't done much audio editing you may regard this as merely interesting. But I've done a lot of audio editing and to me it's astonishing, wonderful, and transformative. Here's a similar example in the video domain: Al Jazeera's Palestine Remix serves up a selection of short documentaries. You play them as transcript-synced video and drag paragraphs from one or more sources to create a transcript-synced video remix that you can play, add effects, and share.

This magic depends on heavy lifting upstream in the data preparation phase. The transcripts are exhaustively annotated with per-word timecodes. It's a service you can buy nowadays (for example, from Speechmatics for 6 cents per minute), but isn't yet cheap enough to warrant casual use -- say, to transcribe a team meeting. Moore's Law will bring such use within reach before long.

How's the transcription quality? We'll soon see. I launched a Speechmatics transcription of my interview with Ward Cunningham when I began writing this column; we'll find out how it went when I'm done.

The upstream heavy lifting enables word-accurate playback of audio or video controlled by selection of text in the transcript. But when you edit the transcript you create a new requirement: for downstream heavy lifting that realigns a new text with an edited soundtrack. That's part of what Laurian Gridinoc is working on.

It's no surprise that the BBC and Al Jazeera are early adopters of this technology. Transcribed and edited audio/video is their product. Most of us do something else for a living. But all of us could benefit from the same capabilities.

Tomorrow, for example, one member of my team is going to show another how to build and deploy a new instance of our service. I need to see that happen. I also need to digest what they do and remix it into documentation that others can use to reproduce the deployment. Because I'll be traveling tomorrow I won't be able to watch live, so I've asked them to capture the on-screen action, with narration, as a video. It's great to be able to do that easily, and I'm willing to invest the hour it'll take to watch the whole thing.

But others won't want to, and they shouldn't have to. It should be easy for me to condense that video into a tight demo that highlights the essential steps. Even if I boil it down to 15 minutes, I shouldn't assume others will want to invest 15 minutes. They could scan a transcript in no time flat, determine if the presentation addresses their questions, and use the text as an index that aids both first-time viewing and subsequent reference use.

I've always expected all that. Now it feels imminent. How did the experiment turn out? Here's a sample of the transcribed text:

He pioneered a transformative new approach to making software supported business processes transparently understandable both to developers and to users.

Here's a fragment of the corresponding annotation:    

{      
 "duration": "0.250",       
 "confidence": "0.995",       
 "name": "He",       
 "time": "12.268"    
},     
{      
 "duration": "0.590",       
 "confidence": "0.995",       
 "name": "pioneered",       
 "time": "12.518"    
},     
{      
 "duration": "0.040",       
 "confidence": "0.995",       
 "name": "a",       
 "time": "13.107"    
},     
{      
 "duration": "0.840",       
 "confidence": "0.995",       
 "name": "transformative",       
 "time": "13.148"    
},     
{      
 "duration": "0.160",       
 "confidence": "0.995",       
 "name": "new",       
 "time": "13.988"    
},     
{      
 "duration": "0.500",       
 "confidence": "0.995",       
 "name": "approach",       
 "time": "14.148"    
},     
{      
 "duration": "0.090",       
 "confidence": "0.995",       
 "name": "to",       
 "time": "14.648"     },     
{      
 "duration": "0.370",       
 "confidence": "0.995",       
 "name": "making",       
 "time": "14.738"    
},     
{      
 "duration": "0.510",       
 "confidence": "0.995",       
 "name": "software",       
 "time": "15.107"    
}, 

Yep. We're getting there.

To comment on this article and other InfoWorld content, visit InfoWorld's LinkedIn page, Facebook page and Twitter stream.
Related:
From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.