On Transcription Software

Interviews are a great way to create original content, but they aren't easy to facilitate and distribute. You need great questions, a flow, and a directing line. If the interview is in an audio or video format, you need timestamps, notes, and transcriptions.

Transcriptions are especially important. As a host, even if you publish the interview as-is, you'll need transcripts to improve your SEO. If you need to go through the whole interview to produce an article, it's way faster to transcribe the video directly and work from there.

The average human's typing speed is somewhere between 40 and 75 words per minute. Speaking is twice faster, from 110 to 200 words per minute, but it's nowhere quicker than reading: between 200 and 450 wpm! This is why I always prefer reading rather than listening when I have a choice, especially if I need to take notes or work with audio material.

The problem is that transcription services aren't cheap. According to Google, a professional transcriptionist makes $90-180 per audio hour. Automated transcription software services are cheaper, but it's still at least $12 per hour.

I have a mission going on at the moment where I need to make an article from a 50-minute long video interview. It would take me about two hours to transcribe and take notes from it if I were to do it manually. I'm paid $25 per hour, so if I were to pay for a transcription software service, I would lose 50% of my paycheck. I'm also paid by the number of words I publish, so it's in my best interest to write faster (the more words per hour, the bigger my hourly rate), but 50% is not an acceptable loss ratio, in my opinion.

I have two solutions. I can either build my own transcription engine, or I can use a low-level transcription API.

Building my own engine would mean using something like TensorFlow's DeepSpeech and feed it data. Having studied the basics of machine learning in college, I know that training your own models is not a trivial task. I might try it later, but for now, I need something I can quickly use. Hence my decision to go for a low-level transcription API.

After some brief research, I settled for Google Cloud Speech-to-Text's API. You can try a demo on the landing page, so I know it's accurate for my use case. According to Google, the error rate is 5%, which is the standard for most speech-to-text models out there. The best part is the pricing: $0.024 per min, or $1.44 per hour. You only pay for what you use, and you have one hour free every month.

In other words, I can build my own local transcription service at a tenth of the cost most service providers ask for, and I can opt-out anytime. Once the algorithm does the bulk of the work, I can then use a free tool like oTranscribe to correct the last few mistakes.

Let's see how it goes and I'll write a tutorial about it later this week.