Google implemented the Web Speech API (both for speech recognition and synthesis) into Chrome, which you can use if you are a developer. This is what YouTube uses to generate close captioning on some videos. Maybe you'll find code to interact with it.
The data flow would probably be:
A video file => extract and convert audio => send it to Google API => get the text => write into a SRT.
EDIT: there doesn't seem to be an official API page, other than the W3C spec. So here are more links:
- http://www.sitepoint.com/experimenting-web-speech-api/
- http://www.smashingmagazine.com/2014/12/05/enhancing-ux-with-the-web-speech-api/
These examples are about using the API from inside Chrome, but you can directly query Google's online speech recognition engine. For instance, Jasper, a speech-recognizing personal assistant for Raspberrry Pi, lets you choose Google as the speech recognition engine.