Skip to content
cesine edited this page Jun 7, 2011 · 12 revisions

PocketSphinx is service which performs speech recognition (also referred to as automatic Speech Recognition ASR, CSR, Voice Recognition, Voice to Text).

It was ported to Android by the cmusphinx team at SourceForge.

This repo aims to turn it into an offline eyes-free ASR service for Android to fill in some of the gaps left by the com.google.voicesearch package.

Table of Contents

Requirements

This project aims to use the PocketSpinx to meet the following requirements:

  • Offline speech recognition
  • Run service on a pre-recorded audio file
    • Preferred format: ogg mp3
  • Eyes Free Speech Recognition

Design Goals

Design the service in such a way that

  • developer can write logic to check if wifi is active, and only transfer data when on wifi
  • developer can write logic to schedule the service overnight for example...
  • developer can provide the logic to record via handset OR paired headsets microphone OR bluetooth
  • Audio is saved to device so that the user doesn't "loose" their thoughts and they can re-listen to their audio to correct the transcription.

Use cases

Transcribe Dictation/create transcript for podcasts/create subtitles for video

  • Audio File to Text (Sample result: 00:00:00-->00:00:04 ; {exactly, eggs acting, exacting})
  • Boolean to run it on the device, or to send it to a Sphinx server/cloud elsewhere (can contact external server which accepts registration of new sphinx servers running a Sphinx web service, more servers = more load balancing and less bandwidth)

General Eyes-Free Offline Speech Recognition

  • Register PocketSphinx as a service which responds to android.speech.RecognizerIntent so that users can make it the default in the preferences (ie. if they have no data connection on their Android, or they are generally not online)
  • Create an open Intent for other developers to call PocketSphinx
  • Function very similarly to com.google.android.voicesearch , except no UI no button, and allow a boolean to control whether it stops "listening" on silence, or on user action {back button, screen tap, gesture, top to bottom swipe etc}

Additional Requirement for Transcription of Audio Files

Audio file processing should allow for

  • a boolean splitOnSilence
    • splits on a conservative threshold of over 40ms of silence (-25dB),
    • will offer high precision
    • Use case: each split will be a coherent utterance that is useful for general public
  • a boolean splitOnProsodicPhrase
    • splits on a number of prosodic cues {glottalization, lengthened vowel, pitch, breath, hesitation, etc}
    • experimental
    • will offer high recall by most likely over splitting
    • Use case: each split will be a prosodic phrase, which are useful for linguists or speech developers but not for general public
By default splitOnSilence should be true to minimize long running processes and allow for maximal concurrency by sending off multiple blocks of speech to be transcribed
  • Pre-process by creating an annotation file for the audio (Development of an Android Audio -> SRT time codes is happening here)
    • Format: .srt or WebVTT or .sbv (YouTube format)
    • Reasoning: if the time annotation is provided in a .srt format it will allow for re-use of the code for other developer's purpose including
      • Displaying subtitles, and re-syncing subtitles in a video player if they are out of sync
      • Displaying transcripts of podcasts while playing in a music player
    • Dependancies:
      • Ecoding from MP3 to another format
      • Detecting silence
    • Sphinx considerations
      • This sort of preprocessing step is probably already implemented in PocketSphinx, its just a question of finding it.
      • This step is also implemented somewhere in the LIUM tools
      • The MARF project has some libraries for audio analysis. Not sure how complete and which goals have been realized yet. MARF is an open-source research platform and a collection of voice/sound/speech/text and natural language processing (NLP) algorithms written in Java and arranged into a modular and extensible framework facilitating addition of new algorithms.
      • If the silence detection is not easy to find consider implementing a lightweight solution using the Java Sound API

References