Transcription is important. How important? So important that it’s a matter of national security.

True, for your particular endeavor U.S. national security may not be at stake, but the viability of your business, your organization, your academic research, and/or your publication may be at stake. So many crucial things are said verbally – and recorded electronically – that get lost in the ether, only because no one had the time to go through the vast volumes of audio after the fact. It’s impossible for one or a few people to sit down and listen through the verbiage. No one has that kind of time.

The U.S. National Security Agency is well aware of that. That’s why the NSA uses speech recognition technology to create transcripts of phone calls that can be easily searched and stored. According to The Intercept,

Spying on international telephone calls has always been a staple of NSA surveillance, but the requirement that an actual person do the listening meant it was effectively limited to a tiny percentage of the total traffic. By leveraging advances in automated speech recognition, the NSA has entered the era of bulk listening.

To sift through and analyze the massive amounts of voice communications collected by the NSA, “human listening was clearly not going to be the solution,” wrote The Intercept. “There weren’t enough ears,” an NSA whistleblower told the publication.

The speech recognition technology used by the NSA certainly has limitations. Without a doubt, it doesn’t transcribe with 100 percent accuracy. They’re probably lucky if they get 90 percent accuracy. And with speech recognition technology, you just get one large block of text, with no paragraphs, speaker labels, and usually no periods where sentences are supposed to end. Instead, it’s just one gigantic run-on sentence. So for their important stuff, I’m sure the NSA uses human transcriptionists, instead of speech recognition software.

We offer speech recognition services, but few folks take advantage of it due to the pitfalls thereof mentioned above. For your important stuff, you need human transcriptionists. At least for the NSA, it’s a matter of national security.

If you don’t already audio-record your interviews, you most definitely should. Apart from those who know shorthand, few people can write or type fast enough to get everything that’s said. That could seriously impact the completeness and the accuracy of your final product.

If you do regularly record your interviews, then that’s great. But you know how tedious and time-consuming it is to transcribe them. The process can add several hours to the time it takes to complete an article. That’s valuable time you could better spend on more productive activities like writing, researching and interviewing.

So have your interviews transcribed.

And that leads to another benefit. For journalists, interviewers, researchers, authors, and others, one of the sweetest-sounding phrases in the English language is: copy and paste.

Imagine copying and pasting without having to worry one whit about copyright infringement. Also imagine kicking back and relaxing while conducting your interviews, without having to bother with taking notes. And, imagine cutting hours off of the time it takes you to write an article.

No, it isn’t too good to be true.

When someone talks to you during an interview or other venue, they may explain how something works in their own words. If you record what they say and have it transcribed, you can copy and paste their words into your article without having to worry about whether it’s word-for-word with something already published. That’s because when people talk, it’s original phrasing. Just like a snowflake – no two off-the-cuff spoken paragraphs are alike.

To be sure, you still need to attribute the speaker, e.g. write “according to” so-and-so. And frankly, people most of the time don’t speak with perfect grammar and organization, so you’ll probably end up massaging and rearranging the copied-and-pasted words quite a bit anyway. But at least you have something to start with. Transcripts are a great cure for writer’s block.

Make sure your transcription service provider does the following:

* Writes out contractions when the spoken speech consists of contractions. People usually speak with contractions – like saying “didn’t” instead of “did not”. But for some reason transcriptionists can be prone to writing it out as “did not” even though the contraction was used. So when someone says “won’t”, it should be written that way rather than “will not”. Same for “he’s” rather than “he is,” or “that’s” rather than “that is,” etc. However, often the speaker says something like “Mary’s going to the ….” In that case, it should be written out as “Mary is going to the…” (even if it sounds like the speaker said “Mary’s). That way, it eliminates any confusion over whether it means Mary’s — i.e. possessive, like Mary’s car.

* Correctly differentiates between “it’s” and “its”. A common error is writing its vs. it’s. “Its” is possessive – indicating something that belongs to something such as “its button” or “its handle.” By contrast, “it’s” is a contraction for “it is”. A good way to know the difference is to determine whether one can say “it is” with whatever is being said. If so, then it’s “it’s.” If not, then it’s “its.”

* Inserts a question mark when the speaker’s tone of voice indicates a question. Make sure there’s no period when there should be a question mark.


How should a transcription company handle instances of inaudibles, background conversations, periods of silence, and audio gaps?

* In the transcript, “(inaudible)” should be indicated for unintelligible words. Better yet is “(inaudible at 00:23:22)” or whatever the timestamp is, so that the reader can quickly find and listen to the spot in question.

* Periods of silence or periods of inaudible background conversation should be indicated in parentheses, i.e., “(inaudible background conversation)”. Or periods of people speaking simultaneously should be noted as “(overlapping voices)” or “(crosstalk)” or “(interposing)”.

* Laughter and other non-speaking sounds should be noted as well, e.g. “(laughter)” or “(chuckles)”

* If silence or background conversation goes on for a significant amount of time, the number of minutes of non-speaking should be indicated.

* If it sounds like there’s a break or interruption in the audio, “(audio gap)” should be indicated.

People, of course, don’t always speak in a crystal-clear manner. Or the quality of the audio recording could be subpar. Or there could be background noise. So sometimes it’s hard to discern certain words when listening to recordings of spoken speech. In those cases, when a transcriptionist is not 100 percent certain of what the person said, “(ph)” should be inserted after the word to indicate the transcriptionist’s best guess or phonetic. And when a group of words or a phrase is unclear, then that phrase should be within brackets, followed by a (ph).

It’s a similar situation for words that are clear, but where the spelling is unknown, such as with people’s last names. In that case, “(sp?) or (sp) should follow words (oftentimes names) where a listener is not sure of the spelling.

Google helps. A good transcription company should make sure its transcriptionists are Googling what unfamiliar words sound like, and seeing what comes up. Same with spellings of speakers’ names. One often can determine that by typing the person’s name based on how it sounds, and key words associated with the person, such as his or her profession or affiliation.

But if the word or name still is unknown, with (ph) or (sp) in place, the end user of the transcript – who often was present at the recording and who is more familiar with what was said and with the spellings of names – can do a quick search for each instance of (ph) or (sp) and make the appropriate corrections.


Most of the time you don’t need every little utterance included in your transcript, such as um, hm, uh-huh, OK (especially when it’s the interviewer), like, etc. unless it’s essential to the meaning of what’s being said. Those words are distracting to the reader – you want to read through the transcription smoothly. So if they’re just superfluous things that aren’t important, then best to leave them out.

Same with “false starts”. When people speak they often start a few words of a sentence, and then stop and start over repeating those same words or altering their words. Those also are distracting to the reader in a transcript, so it’s usually best not to transcribe those as well.

A transcript where every utterance and false start is included is known as a verbatim transcript. They’re often desired for legal transcription or things such as student or patient evaluations. But the vast majority of the time, they just throw off or slow down the reader.

In a related vein, when people speak, even in formal settings, they often use words like wanna, cuz, gonna, etc. In those cases it’s best that in the transcript, they’re written out as want to, because, going to, etc. – unless the transcript should be verbatim. Again, writing out want to instead of wanna, because instead of cuz, etc., results in smoother reading.

Similarly, depending on what the transcript will be used for, obvious grammar errors in the spoken speech usually should be corrected in the transcript, such as changing “she don’t know” to  “she doesn’t know.” An exception is “ain’t” – it should be written out if the speaker says it.

It often is not possible for the transcriptionist to know the names of each speaker unless they state their names clearly or if the client provides the names. So in lieu of names, descriptors like “interviewer”, “speaker” or “participant” are used.

When there are two speakers and it’s a question and answer format, the natural descriptor for the person asking the questions is INTERVIEWER. While the person answering the questions is technically the interviewee, that word is too close in appearance to “interviewer”, so a better term for labeling the interviewer is RESPONDENT.

When there are multiple speakers, and it is in a meeting or focus group format, SPEAKER or PARTICIPANT are common descriptors. The transcriptionist the vast majority of the time works with audio rather than video, so it is often not easy to discern the speaker names. Female voices are particularly difficult to differentiate. But rather than writing PARTICIPANT each time, there are a few steps that can be taken to get a better idea of who is speaking, when the names aren’t known.

It is usually easy to pinpoint the moderator so that speaker has the MODERATOR label. If the group is composed of both males and females, the descriptors can be FEMALE PARTICIPANT or MALE PARTICIPANT. And if anyone in the group has an accent, descriptors such as MALE VOICE/NONENGLISH ACCENT or FEMALE VOICE/SPANISH ACCENT should be used.

So when it’s a group of people being transcribed, you can still get a good idea of who is speaking. Make sure your transcription company follows those guidelines. And if you specifically want to know who said a certain passage, you can use the timestamps to cross-check the passage in the transcript with the audio. That’s why it’s also important that your transcripts contain timestamps.

There are of course higher-quality transcripts and lower-quality transcripts. Some of the characteristics of the latter include the following:

* Run-on sentences. A less-skilled transcriptionist may not insert periods and commas at pauses, or at the end of a thought or statement. A speaker may sound like his or her sentences are running together, but actually, they are not. A good transcriptionist knows where a thought ends and where a new one begins, and inserts a period (or if appropriate, a comma) there. If a speaker cuts off a sentence and switches to a new thought, some form of punctuation should be present, such as dash ( — ), to indicate that.

* Too-long paragraphs. If the speaker is speaking for a long time, the text should be broken up into paragraphs, which makes it easier for the reader to follow.

* “Improvising”, i.e. where the transcriptionists writes the same meaning of what the speaker said using a different word or words that the speaker actually used.

* Omitting perfectly audible words or phrases. (And we’re not talking about words like um’s, uh-huh’s, OK’s, etc.) Or, if the word or phrase is inaudible, there may be no “(inaudible)” or some such indicator that a word was there.

So make sure the transcription company you go with avoids the above.

When you want to cross-check a word or passage in the transcript with the audio file, how do you find the correct spot in the audio file? That’s a daunting prospect – you’d have to estimate where in the audio file to go, based on how far into the transcript you are. But you almost never find the right spot right off the bat – you’re initially usually many minutes off – and it’s a hassle trying to find that point of convergence. It involves listening to a passage in the audio, then doing a search/replace in the transcript to find out where you are in the audio in relation to the spot in the transcript you want to pinpoint, and probably repeating that several times before you find the right spot. How tedious and time-consuming!

Fortunately there’s an easier way: timestamps. If the transcript contains timestamps (a.k.a. timecodes) every minute or so based on the running time of the audio file, then finding the desired spot is easy. If you’re at a spot in the transcript that reads “he went to the store” and it’s near the [00:28:45] timestamp, then you just skip to the 28 minute and 45 second spot in the audio file and walla, you found “he went to the store”.

So make sure your transcription company gives you transcripts with timestamps peppered throughout – every minute or so. If they don’t, you just may have lot of annoying work in store for you.

When the transcription company you’re working with delivers a transcript to you, its file name should be the exact same file name as the audio file from which the transcript was derived (apart from the file extensions, i.e. .doc vs. .mp3.) That way, there’s no confusion as to which audio file the transcript corresponds to; you know exactly which transcript goes with which audio file.

And it’s important to keep the file names consistent because while reading through the transcript, you may often have to refer back to the audio file, for example in order to verify certain words or passages, decipher inaudibles, or listen to the speaking styles.

And that brings up another point that we’ll look at in the next post: timestamps.