For anyone interested in sound and sound recordings


Speech to Text

This group is for anyone interested in speech-to-text and voice recognition systems. Technologies that can convert the spoken word into text are growing in number, from voice-activated commands for smartphones to the conversion of digital speech archives into readable and searchable text.

The particular focus of this group is the conversion of speech archives. It is interested in the competing technologies, cost-effective models for archives and libraries, and the uses of transcribed speech collections (audio and video) for researchers.

Members: 12
Latest Activity: Sep 12


The Comment section is for general information and conversation about speech-to-text matters. The Discussion Forum will be used to list specific packages and to discuss issues relating to these.

Discussion Forum


Started by Luke McKernan Jan 17. 0 Replies


Started by Mari King. Last reply by Mari King May 29, 2013. 2 Replies

Palaver + VoxForge

Started by Luke McKernan May 16, 2013. 0 Replies

CONTENTUS - Next Generation Multimedia Libraries

Started by Mari King. Last reply by Mari King May 7, 2013. 1 Reply

BBC Snippets

Started by Mari King Apr 10, 2013. 0 Replies


Started by Luke McKernan Mar 9, 2013. 0 Replies


Started by Luke McKernan Mar 6, 2013. 0 Replies


Started by Richard Ranft Mar 5, 2013. 0 Replies

Comment Wall

Add a Comment

You need to be a member of Speech to Text to add comments!

Comment by Luke McKernan on September 12, 2014 at 8:10

TV monitoring service is fair use, judge rules
Last year, Fox News sued a media-monitoring service called TVEyes, which allows its clients to search for and watch clips of TV and radio stations.

Fox lawyers argued the service violated copyright law and should be shut down. In a ruling published yesterday, US District Judge Alvin Hallerstein disagreed, finding that TVEyes' core services are a transformative fair use.

It's a significant digital-age fair use ruling, one that's especially important for people and organizations who want to comment on or criticize news coverage.

TVEyes constantly records more than 1,400 television and radio stations, using closed captions and speech-to-text technology to make a comprehensive and searchable database for its subscribers, who generally pay $500 per month for the service. The company has more than 2,200 subscribers, including the White House, 100 members of Congress, the Department of Defense, as well as big news organizations like Bloomberg, Reuters, ABC, and the Associated Press.

The service is used by a wide range of clients who want to keep an eye on the media, from police departments seeking to know how widely a public safety announcement has disseminated, to members of Congress who want to know what's being said about them.

It's also—perhaps not coincidentally—used by media critics, including those who keep an eye on Fox News. For instance, Media Matters for America has used TV Eyes to analyze Fox News' Benghazi-flavored coverage of Hillary Clinton, as well as what it calls the network's "selective outrage" over gay rights.

One common use for TVEyes is to let users search for a keyword to find out when a term was mentioned in the news, then view a video clip that starts 14 seconds before the keyword is mentioned, and goes on for up to 10 minutes. Most clips are shorter than two minutes.

Users can also download and save the clips, and share them via social media or email. TVEyes subscribers all agree to only use downloaded clips for "internal purposes" like review, analysis, or research.

In Fox's view, those products all compete unfairly with its own TV clip licensing, which is done through ITN Source; that company maintains a library of 80,000 Fox News videos and is searchable using keywords. Through ITN Source, Fox News has made about $2 million in licensing fees.

Read more

Comment by Luke McKernan on August 7, 2014 at 12:42

MIT Scientists Figured Out How to Eavesdrop Using a Crisp Packet

In a scenario straight out of "Enhance, enhance!", MIT scientists have figured out that the tiny vibrations on ordinary objects like a crisp packet, glass of water or even a plant can be reconstructed into intelligible speech. All it takes is a camera and a snappy algorithm.

Sound waves, after all, are just disturbances in the air. When sound hits something light and delicate like a crisp packet the object will vibrate ever so slightly. Now, you've probably noticed that house plants and crisp packets do not sway and shake when you have a conversation. To capture movements as small as a tenth of a micrometre (or five thousandths of a pixel) the team tracked the colour of single pixels over time. Here's how it works, as explained by a press release from MIT:

Suppose, for instance, that an image has a clear boundary between two regions: Everything on one side of the boundary is blue; everything on the other is red. But at the boundary itself, the camera's sensor receives both red and blue light, so it averages them out to produce purple. If, over successive frames of video, the blue region encroaches into the red region — even less than the width of a pixel — the purple will grow slightly bluer. That colour shift contains information about the degree of encroachment.

At first, the team used high-speed cameras shooting 2,000 to 6,000 frames per second through soundproof glass. In this case, the camera is shooting faster than the frequency of audible sound. As you can hear in the video above, speech recovered from a vibrating plant is fairly understandable.

But the coolest part is that the team was able to extract sound from ordinary 60 frame per second video cameras, by exploiting a technical quirk. The camera's sensor captures images by scanning horizontally, so certain parts of the image are actually recorded slightly after others. The rolling shutter sensor quirk let the team reconstruct audio even from video that was shot at rates slower than the frequency of sound. It's definitely fuzzier than with a high-speed camera, but one might still identify the number of speakers.

The researchers are presenting their work at the computer graphics conference Siggraph this month. We can think of a few other people *cough* who might be interested.

Comment by Luke McKernan on June 16, 2014 at 8:02

Google Glass Offers Disabled People Access to a Bigger World

The photo is a blur. A wide swath of blue – the photographer’s torso or maybe someone else’s – spreads across the left half of the image. A dark square and rectangles of brown, like the open flaps of a cardboard box, fill the right.

As a picture, it’s unremarkable, an image taken apparently at random and perhaps by mistake. But for the photographer, it’s nothing short of momentous.

Ashley Lasanta has cerebral palsy, and for the first time in her 23 years, she was able to snap – and then share – a photograph, all without the use of her hands.

"It was awesome," she says. "I take pictures of just about anything."

The device she used wasn’t a traditional camera. It was Google Glass, the thumb-sized computer that's worn like a pair of glasses. With just a tilt or a nod of the head and a few spoken phrases, Lasanta can record videos, send emails, browse the web far faster than before, play games and, thanks to the wealth of recipes online, hang out in the kitchen and help with cooking ...

[more text]

Comment by Luke McKernan on May 28, 2014 at 8:01

What could be a big leap forward in the application of speech-to-text is being demonstrated by Microsoft, who are testing live video translation for Skype. Demonstration video here:

Comment by Luke McKernan on February 17, 2014 at 8:10

OK Google

Article on Google and voice recognition.

"There can be little doubt that, just like Microsoft thinks touch is the future of computing, Google seems to believe voice will be the user interface of the future..."

Comment by Luke McKernan on January 26, 2014 at 18:38

Interesting piece on using YouTube's automatic transcriptions feature to generate rough transcripts from which to produce more accurate records.

Dirty, Fast, and Free Audio Transcription with YouTube

Five years ago, I wrote about how I transcribe audio with Amazon's Mechanical Turk, splitting interviews into small segments and distributing the work among dozens of anonymous people. It ended up as one of my most popular posts ever, continuing to draw traffic and comments every day.

Lately, I've been toying with a free, fast way to generate machine transcriptions: repurposing YouTube's automatic captions feature.

How It Works

Every time you upload a video, YouTube tries to generate a caption file. If there's audible text, you can grab a subtitle file within a few minutes of uploading the video.

But how's the quality? Pretty mediocre! It's about as good as you'd expect from a free machine-generated transcript. The caption files have no punctuation between sentences, speakers aren't broken out separately, and errors are very common.

But if you're transcribing interviews, it's often easier to edit a flawed transcript than starting from scratch. And YouTube provides a solid interface for editing your transcript audio and getting the results in plaintext.

I used TunesToTube, a free service for uploading MP3s to YouTube, to upload the first 15 minutes of our New Disruptors interview, with permission from Glenn Fleishman.

It took about 30 seconds for TunesToTube to generate the 15-minute-long video, three seconds to upload it, and about a minute for the video to be viewable on my account.

It takes a bit more time for YouTube to generate the audio transcriptions. Testing in the middle of a weekday, it took about six minutes to transcribe a two-minute video, and around 30 minutes for the 15-minute video. Fortunately, there's nothing you need to do while it processes. Just upload and wait.

I ran a number of familiar film monologues through the YouTube's transcription engine, and the results vary from solid to laughably bad. I've posted the videos below with the automatic transcription and their actual text.

As you'd expect, it works best with clear enunciation and spoken word. Soft words over background music, like in the Breakfast Club clip, falls apart pretty quick. But some, like Independence Day, aren't terrible.

[See full article for examples]

Comment by Luke McKernan on October 21, 2013 at 17:09

Speaker Diarization Boosts Automatic Speaker Recognition In Audio Recordings

An important goal in spoken-language-systems research is speaker diarization - computationally determining how many speakers feature in a recording and which of them speaks when.

To date, the best diarization systems have used supervised machine learning; they're trained on sample recordings that a human has indexed, indicating which speaker enters when. In a new paper, MIT researchers show how they can improve speaker diarization so that it can automatically annotate audio or video recordings without supervision: No prior indexing is necessary.

They also discuss, compact way to represent the differences between individual speakers' voices, which could be of use in other spoken-language computational tasks.

"You can know something about the identity of a person from the sound of their voice, so this technology is keying in to that type of information," says Jim Glass, a senior research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and head of its Spoken Language Systems Group. "In fact, this technology could work in any language. It's insensitive to that."

To create a sonic portrait of a single speaker, Glass explains, a computer system will generally have to analyze more than 2,000 different acoustic features; many of those may correspond to familiar consonants and vowels, but many may not. To characterize each of those features, the system might need about 60 variables, which describe properties such as the strength of the acoustic signal in different frequency bands.

New algorithm determines who speaks when in audio recordings represents every second of speech as a point in a three-dimensional space. In an iterative process, it then groups the points together, associating each group with a single speaker.

E pluribus tres

The result is that for every second of a recording, a diarization system would have to search a space with 120,000 dimensions, which would be prohibitively time-consuming. In prior work, Najim Dehak, a research scientist in the Spoken Language Systems Group and one of the new paper's co-authors, had demonstrated a technique for reducing the number of variables required to describe the acoustic signature of a particular speaker, dubbed the i-vector ...

Comment by Luke McKernan on August 22, 2013 at 8:12

Nuance’s “talking ads” speak their first words – in Swedish

In April, Nuance Communications decided to bring its speech recognition and natural language understanding technology to a new industry: advertising. It created what it called the Voice Ad, a marketing medium that allows mobile users to speak to a digital advertisement on their phones and receive an answer back.

Nuance teamed up with Millennial Media, Jumptap (which are set to merge) and Ad Marvel to bring these new interactive ads to market, but it turns out that a European ad network beat all of them to the punch. Widespace is debuting the first Nuance-powered voice ad in two Swedish media company apps: those of Nordic daily newspaper Expressen and television programming guide Tv24.

There are no details yet on what form the ads will take, but in general Nuance’s voice ads are supposed to be self-contained brand-specific versions of a virtual assistant like Siri. Users can interact with the ads by asking them plain-speech questions. Nuance’s language servers in the cloud interpret the question and provide the appropriate response either via text, video or spoken word.

Nuance and its partners are encouraging brands and their ad agencies to use their traditional spokespeople as the voice blueprint for the ads. So, for instance, if you’re selling insurance and the face of your TV ads is Morgan Freeman, Freeman’s pre-recorded voice can answer your questions within the ad.

Comment by Luke McKernan on August 8, 2013 at 17:26

The Bl's Opening up Speech Archives project concluded last week. We will continue to investigate the best ways in which to assimilate speech-to-text technologies into our discovery systems. A project web page at has an overview of the work undertaken. A full report is in preparation.

Comment by Luke McKernan on August 5, 2013 at 10:01

CyberAlert Launches Nationwide Radio Monitoring Service

CyberAlert, the all-in-one media monitoring and measurement company, announced today the launch of a comprehensive radio monitoring service for public relations and marketing.

CyberAlert Radio monitors more than 250 news and talk radio stations in the Top 50 U.S. markets. The monitoring covers all local and national news along with local and syndicated talk shows. Using advanced speech-to-text technology, the new radio monitoring service identifies radio clips based on key words specified by CyberAlert’s clients and delivers the text of the radio broadcast. Clients can also order high-quality downloadable audio files of broadcasts from most radio markets.

CyberAlert’s radio monitoring service can be ordered as a stand-alone service or in an integrated package with online, TV news and social media monitoring.

“Radio is the missing component in many media monitoring services, yet it is a channel that heavily impacts public opinion,” said William J. Comcowich, CEO of CyberAlert. “With the addition of radio to our media monitoring services, CyberAlert now truly is the all-media service covering the full scope of online news, broadcast news and social media. Our clients now have the benefit of a fully-integrated and low-cost monitoring and measurement service so they won’t miss a mention, no matter where it occurs.”

Like its other media monitoring services, CyberAlert Radio offers customized key word searches and delivers all clips overnight to the client’s email. CyberAlert also stores each client’s news radio monitoring clips in an online digital clip archive with a full-featured dashboard for clip management with unlimited storage.

More information on radio monitoring and CyberAlert’s online and social media monitoring and measurement services can be found at

About CyberAlert:

Founded in 1999 as one of the very first SaaP and cloud computing services, CyberAlert ( is a worldwide news monitoring, broadcast monitoring, social media monitoring and media measurement service. CyberAlert® 5.0 worldwide online news monitoring service monitors 55,000+ online news sources each day in 250+ languages in 191 countries. The company’s TV broadcast monitoring service monitors the closed caption text and video feed of over 2,100 news programs on over 600 TV stations in all 210 markets in the United States. CyberAlert’s radio monitoring service covers more than 250 radio stations in the top 50 U.S. markets. For social media monitoring, CyberAlert monitors over 75 million blogs worldwide, 100,000 Web message boards and UseNet news groups, over 200 video sharing sites like YouTube, Twitter and Facebook for consumer insight about companies, products, key issues and trends. CyberAlert offers a no-risk 14-day free media monitoring trial for most media monitoring services.


Members (12)


© 2014       Powered by

Badges  |  Report an Issue  |  Terms of Service