Playback

For anyone interested in sound and sound recordings

Information

Speech to Text

This group is for anyone interested in speech-to-text and voice recognition systems. Technologies that can convert the spoken word into text are growing in number, from voice-activated commands for smartphones to the conversion of digital speech archives into readable and searchable text.

The particular focus of this group is the conversion of speech archives. It is interested in the competing technologies, cost-effective models for archives and libraries, and the uses of transcribed speech collections (audio and video) for researchers.

Members: 12
Latest Activity: yesterday

GROUP MATTERS

The Comment section is for general information and conversation about speech-to-text matters. The Discussion Forum will be used to list specific packages and to discuss issues relating to these.

Discussion Forum

Interviewy

Started by Luke McKernan Jan 17. 0 Replies

iMIRACLE by AT&T WATSON

Started by Mari King. Last reply by Mari King May 29, 2013. 2 Replies

Palaver + VoxForge

Started by Luke McKernan May 16, 2013. 0 Replies

CONTENTUS - Next Generation Multimedia Libraries

Started by Mari King. Last reply by Mari King May 7, 2013. 1 Reply

BBC Snippets

Started by Mari King Apr 10, 2013. 0 Replies

Darpa

Started by Luke McKernan Mar 9, 2013. 0 Replies

Luxid

Started by Luke McKernan Mar 6, 2013. 0 Replies

AVAtech

Started by Richard Ranft Mar 5, 2013. 0 Replies

Comment Wall

Add a Comment

You need to be a member of Speech to Text to add comments!

Comment by Luke McKernan yesterday

Great video Mari. We're getting nearer to whatever it is we want to get nearer to.

Comment by Luke McKernan yesterday

Comment by Mari King yesterday

Behind the Mic: The Science of Talking with Computers

Language. Easy for humans to understand (most of the time), but not so easy for computers. This is a short film about speech recognition, language understanding, neural nets, and using our voices to communicate with the technology around us.

Comment by Luke McKernan on September 12, 2014 at 8:10

TV monitoring service is fair use, judge rules

http://arstechnica.com/tech-policy/2014/09/tv-monitoring-service-is...
Last year, Fox News sued a media-monitoring service called TVEyes, which allows its clients to search for and watch clips of TV and radio stations.

Fox lawyers argued the service violated copyright law and should be shut down. In a ruling published yesterday, US District Judge Alvin Hallerstein disagreed, finding that TVEyes' core services are a transformative fair use.

It's a significant digital-age fair use ruling, one that's especially important for people and organizations who want to comment on or criticize news coverage.

TVEyes constantly records more than 1,400 television and radio stations, using closed captions and speech-to-text technology to make a comprehensive and searchable database for its subscribers, who generally pay $500 per month for the service. The company has more than 2,200 subscribers, including the White House, 100 members of Congress, the Department of Defense, as well as big news organizations like Bloomberg, Reuters, ABC, and the Associated Press.

The service is used by a wide range of clients who want to keep an eye on the media, from police departments seeking to know how widely a public safety announcement has disseminated, to members of Congress who want to know what's being said about them.

It's also—perhaps not coincidentally—used by media critics, including those who keep an eye on Fox News. For instance, Media Matters for America has used TV Eyes to analyze Fox News' Benghazi-flavored coverage of Hillary Clinton, as well as what it calls the network's "selective outrage" over gay rights.

One common use for TVEyes is to let users search for a keyword to find out when a term was mentioned in the news, then view a video clip that starts 14 seconds before the keyword is mentioned, and goes on for up to 10 minutes. Most clips are shorter than two minutes.

Users can also download and save the clips, and share them via social media or email. TVEyes subscribers all agree to only use downloaded clips for "internal purposes" like review, analysis, or research.

In Fox's view, those products all compete unfairly with its own TV clip licensing, which is done through ITN Source; that company maintains a library of 80,000 Fox News videos and is searchable using keywords. Through ITN Source, Fox News has made about $2 million in licensing fees.

Read more

Comment by Luke McKernan on August 7, 2014 at 12:42

MIT Scientists Figured Out How to Eavesdrop Using a Crisp Packet

http://www.gizmodo.co.uk/2014/08/mit-scientists-figured-out-how-to-...

In a scenario straight out of "Enhance, enhance!", MIT scientists have figured out that the tiny vibrations on ordinary objects like a crisp packet, glass of water or even a plant can be reconstructed into intelligible speech. All it takes is a camera and a snappy algorithm.

Sound waves, after all, are just disturbances in the air. When sound hits something light and delicate like a crisp packet the object will vibrate ever so slightly. Now, you've probably noticed that house plants and crisp packets do not sway and shake when you have a conversation. To capture movements as small as a tenth of a micrometre (or five thousandths of a pixel) the team tracked the colour of single pixels over time. Here's how it works, as explained by a press release from MIT:

Suppose, for instance, that an image has a clear boundary between two regions: Everything on one side of the boundary is blue; everything on the other is red. But at the boundary itself, the camera's sensor receives both red and blue light, so it averages them out to produce purple. If, over successive frames of video, the blue region encroaches into the red region — even less than the width of a pixel — the purple will grow slightly bluer. That colour shift contains information about the degree of encroachment.

At first, the team used high-speed cameras shooting 2,000 to 6,000 frames per second through soundproof glass. In this case, the camera is shooting faster than the frequency of audible sound. As you can hear in the video above, speech recovered from a vibrating plant is fairly understandable.

But the coolest part is that the team was able to extract sound from ordinary 60 frame per second video cameras, by exploiting a technical quirk. The camera's sensor captures images by scanning horizontally, so certain parts of the image are actually recorded slightly after others. The rolling shutter sensor quirk let the team reconstruct audio even from video that was shot at rates slower than the frequency of sound. It's definitely fuzzier than with a high-speed camera, but one might still identify the number of speakers.

The researchers are presenting their work at the computer graphics conference Siggraph this month. We can think of a few other people *cough* who might be interested.

Comment by Luke McKernan on June 16, 2014 at 8:02

Google Glass Offers Disabled People Access to a Bigger World

http://www.usnews.com/news/stem-solutions/articles/2014/06/10/googl...

The photo is a blur. A wide swath of blue – the photographer’s torso or maybe someone else’s – spreads across the left half of the image. A dark square and rectangles of brown, like the open flaps of a cardboard box, fill the right.

As a picture, it’s unremarkable, an image taken apparently at random and perhaps by mistake. But for the photographer, it’s nothing short of momentous.

Ashley Lasanta has cerebral palsy, and for the first time in her 23 years, she was able to snap – and then share – a photograph, all without the use of her hands.

"It was awesome," she says. "I take pictures of just about anything."

The device she used wasn’t a traditional camera. It was Google Glass, the thumb-sized computer that's worn like a pair of glasses. With just a tilt or a nod of the head and a few spoken phrases, Lasanta can record videos, send emails, browse the web far faster than before, play games and, thanks to the wealth of recipes online, hang out in the kitchen and help with cooking ...

[more text]

Comment by Luke McKernan on May 28, 2014 at 8:01

What could be a big leap forward in the application of speech-to-text is being demonstrated by Microsoft, who are testing live video translation for Skype. Demonstration video here:

http://qz.com/214106/watch-skype-translate-a-video-conversation-in-...

Comment by Luke McKernan on February 17, 2014 at 8:10

OK Google

http://techcrunch.com/2014/02/16/ok-google/

Article on Google and voice recognition.

"There can be little doubt that, just like Microsoft thinks touch is the future of computing, Google seems to believe voice will be the user interface of the future..."

Comment by Luke McKernan on January 26, 2014 at 18:38

Interesting piece on using YouTube's automatic transcriptions feature to generate rough transcripts from which to produce more accurate records.

Dirty, Fast, and Free Audio Transcription with YouTube

http://waxy.org/2014/01/dirty_fast_and_free_audio_transcription_wit...

Five years ago, I wrote about how I transcribe audio with Amazon's Mechanical Turk, splitting interviews into small segments and distributing the work among dozens of anonymous people. It ended up as one of my most popular posts ever, continuing to draw traffic and comments every day.

Lately, I've been toying with a free, fast way to generate machine transcriptions: repurposing YouTube's automatic captions feature.

How It Works

Every time you upload a video, YouTube tries to generate a caption file. If there's audible text, you can grab a subtitle file within a few minutes of uploading the video.

But how's the quality? Pretty mediocre! It's about as good as you'd expect from a free machine-generated transcript. The caption files have no punctuation between sentences, speakers aren't broken out separately, and errors are very common.

But if you're transcribing interviews, it's often easier to edit a flawed transcript than starting from scratch. And YouTube provides a solid interface for editing your transcript audio and getting the results in plaintext.

I used TunesToTube, a free service for uploading MP3s to YouTube, to upload the first 15 minutes of our New Disruptors interview, with permission from Glenn Fleishman.

It took about 30 seconds for TunesToTube to generate the 15-minute-long video, three seconds to upload it, and about a minute for the video to be viewable on my account.

It takes a bit more time for YouTube to generate the audio transcriptions. Testing in the middle of a weekday, it took about six minutes to transcribe a two-minute video, and around 30 minutes for the 15-minute video. Fortunately, there's nothing you need to do while it processes. Just upload and wait.

I ran a number of familiar film monologues through the YouTube's transcription engine, and the results vary from solid to laughably bad. I've posted the videos below with the automatic transcription and their actual text.

As you'd expect, it works best with clear enunciation and spoken word. Soft words over background music, like in the Breakfast Club clip, falls apart pretty quick. But some, like Independence Day, aren't terrible.

[See full article for examples]

Comment by Luke McKernan on October 21, 2013 at 17:09

Speaker Diarization Boosts Automatic Speaker Recognition In Audio Recordings

http://www.science20.com/news_articles/speaker_diarization_boosts_a...

An important goal in spoken-language-systems research is speaker diarization - computationally determining how many speakers feature in a recording and which of them speaks when.

To date, the best diarization systems have used supervised machine learning; they're trained on sample recordings that a human has indexed, indicating which speaker enters when. In a new paper, MIT researchers show how they can improve speaker diarization so that it can automatically annotate audio or video recordings without supervision: No prior indexing is necessary.

They also discuss, compact way to represent the differences between individual speakers' voices, which could be of use in other spoken-language computational tasks.

"You can know something about the identity of a person from the sound of their voice, so this technology is keying in to that type of information," says Jim Glass, a senior research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) and head of its Spoken Language Systems Group. "In fact, this technology could work in any language. It's insensitive to that."

To create a sonic portrait of a single speaker, Glass explains, a computer system will generally have to analyze more than 2,000 different acoustic features; many of those may correspond to familiar consonants and vowels, but many may not. To characterize each of those features, the system might need about 60 variables, which describe properties such as the strength of the acoustic signal in different frequency bands.

New algorithm determines who speaks when in audio recordings represents every second of speech as a point in a three-dimensional space. In an iterative process, it then groups the points together, associating each group with a single speaker.

E pluribus tres

The result is that for every second of a recording, a diarization system would have to search a space with 120,000 dimensions, which would be prohibitively time-consuming. In prior work, Najim Dehak, a research scientist in the Spoken Language Systems Group and one of the new paper's co-authors, had demonstrated a technique for reducing the number of variables required to describe the acoustic signature of a particular speaker, dubbed the i-vector ...

 

Members (12)

 
 
 

© 2014       Powered by

Badges  |  Report an Issue  |  Terms of Service