Tutorial

Word Error Rate and normalization

In this step-by-step tutorial you will compare the Word Error Rate (WER) of two machine-generated transcripts. The WER is calculated against a less-than-perfect reference made from a human-generated subtitle file. You will also use normalization rules to improve the accuracy of the results.

To follow this tutorial you will need a working installation of benchmarkstt and these source files saved to your working folder:

This demo shows the capabilities of Release 1 of the library, which benchmarks the accuracy of word recognition only. The library supports adding new metrics in future releases. Contributions are welcome.

Creating the plain text reference file

Creating accurate verbatim transcripts for use as reference is time-consuming and expensive. As a quick and easy alternative, we will make a "reference" from a subtitles file. Subtitles are slightly edited and they include additional text like descriptions of sounds and actions, so they are not a verbatim transcription of the speech. Consequently, they are not suitable for calculating absolute WER. However, we are interested in calculating relative WER for illustration purposes only, so this use of subtitles is deemed acceptable.

Warning

Evaluations in this tutorial are not done for the purpose of assessing tools. The use of subtitles as reference will skew the results so they should not be taken as an indication of overall performance or as an endorsement of a particular vendor or engine.

We will use the subtitles file for the BBC's Question Time Brexit debate. This program was chosen for its length (90 minutes) and because live debates are particularly challenging to transcribe.

The subtitles file includes a lot of extra text in XML tags. This text shouldn't be used in the calculation: for both reference and hypotheses, we want to run the tool on plain text only. To strip out the XML tags, we will use the benchmarkstt-tools command, with the normalization subcommand:

benchmarkstt-tools normalization --inputfile qt_subs.xml --outputfile qt_reference.txt --regex "</?[?!\[\]a-zA-Z][^>]*>" " "

The normalization rule --regex takes two parameters: a regular expression pattern and the replacement string.

In this case all XML tags will be replaced with a space. This will result in a lot of space characters, but these are ignored by the diff algorithm later so we don't have to clean these up. --inputfile and --outputfile are the input and output files.

The file qt_reference.txt has been created. You can see that the XML tags are gone, but the file still contains non-dialogue text like 'APPLAUSE'.

For better results you can manually clean up the text, or run the command again with a different normalization rule (not included in this demo). But we will stop the normalization at this point.

We now have a simple text file that will be used as the reference. The next step is to get the machine-generated transcripts for benchmarking.

Creating the plain text hypotheses files

The first release of benchmarkstt does not integrate directly with STT vendors or engines, so transcripts for benchmarking have to be retrieved separately and converted to plain text.

For this demo, two machine transcripts were retrieved for the Question Time audio: from AWS Transcribe and from the BBC's version of Kaldi, an open-source STT framework.

Both AWS and BBC-Kaldi return the transcript in JSON format, with word-level timings. They also contain a field with the entire transcript as a single string, and this is the value we will use (we don't benchmark timings in this version).

To make the hypothesis file for AWS, we will use the transcript JSON field from the transcript generated by AWS, and save it as a new document qt_aws_hypothesis.txt.

We can automate this again using benchmarkstt-tools normalization and a somewhat more complex regex parameter:

benchmarkstt-tools normalization --inputfile qt_aws.json --outputfile qt_aws_hypothesis.txt --regex '^.*"transcript":"([^"]+)".*' '\1'

To make the BBC-Kaldi transcript file we will use the text JSON field from the transcript generated by Kaldi, and save it as a new document qt_kaldi_hypothesis.txt.

Again, benchmarkstt-tools normalization with a --regex argument will be used for this:

benchmarkstt-tools normalization --inputfile qt_kaldi.json --outputfile qt_kaldi_hypothesis.txt --regex '^.*"text":"([^"]+)".*' '\1'

You'll end up with two files similar to these:

Benchmark!

We can now compare each of the hypothesis files to the reference in order to calculate the Word Error Rate. We process one file at a time, now using the main benchmarkstt command, with two flags: --wer is the metric we are most interested in, while --diffcounts outputs the number of insertions, deletions, substitutions and correct words (the basis for WER calculation).

Calculate WER for AWS Transcribe:

benchmarkstt --reference qt_reference.txt --hypothesis qt_aws_hypothesis.txt --wer --diffcounts

The output should look like this:

wer
===

0.336614

diffcounts
==========

equal: 10919
replace: 2750
insert: 675
delete: 1773

Now calculate the WER and "diff counts" for BBC-Kaldi:

benchmarkstt --reference qt_reference.txt --hypothesis qt_kaldi_hypothesis.txt --wer --diffcounts

The output should look like this:

wer
===

0.379744

diffcounts
==========

equal: 10437
replace: 4006
insert: 859
delete: 999

After running these two commands, you can see that the WER for both transcripts is quite high (around 35%). Let's see the actual differences between the reference and the hypotheses by using the --worddiffs flag:

benchmarkstt --reference qt_reference.txt --hypothesis qt_kaldi_hypothesis.txt --worddiffs

The output should look like this (example output is truncated):

worddiffs
=========

Color key: Unchanged ​Reference​ ​Hypothesis​

​​·​BBC​·​2017​·​Tonight,​​​·​tonight​​·​the​​·​Prime​·​Minister,​·​Theresa​·​May,​​​·​prime​·​minister​·​theresa​·​may​​·​the​·​leader​·​of​·​the​​·​Conservative​·​Party,​​​·​conservative​·​party​​·​and​·​the​·​leader​·​of​​·​Labour​·​Party,​·​Jeremy​·​Corbyn,​​​·​the​·​labour​·​party​·​jeremy​·​corbyn​​·​face​·​the​​·​voters.​·​Welcome​·​to​·​Question​·​Time.​·​So,​​​·​voters​·​welcome​·​so​​·​over​·​the​·​next​​·​90​·​minutes,​​​·​ninety​·​minutes​​·​the​·​leaders​·​of​·​the​·​two​·​larger​·​parties​·​are​·​going​·​to​·​be​·​quizzed​·​by​·​our​·​audience​·​here​·​in​​·​York.​·​Now,​​​·​york​·​now​​·​this​·​audience​·​is​·​made​·​up​·​like​·​this​​·​-​​·​just​​·​a​·​third​​·​say​·​they​·​intend​·​to​·​vote​​·​Conservative​·​next​·​week.​·​The​​​·​conserve​·​it​·​the​​·​same​​·​number​​​·​numbers​​·​say​·​they're​·​going​·​to​·​vote​​·​Labour,​​​·​labour​​·​and​·​the​·​rest​·​either​·​support​·​other​​·​parties,​​​·​parties​​·​or​·​have​·​yet​·​to​·​make​·​up​·​their​​·​minds.​·​As​·​ever,​​​·​minds​·​and​·​as​·​ever​​·​you​·​can​·​comment​·​on​​·​all​·​of​·​this​·​from​·​home​​·​either​·​on​​·​Twitter​·​-​​​·​twitter​​·​our​·​hashtag​·​is​​·​#BBCQT​·​-​·​we're​​​·​bbc​·​two​·​were​​·​also​·​on​​·​Facebook,​​​·​facebook​​·​as​​·​usual,​​​·​usual​​·​and​·​our​·​text​·​number​·​is​​·​83981.​·​Push​​​·​a​·​three​·​nine​·​eight​·​one​·​push​​·​the​·​red​·​button​·​on​·​your​·​remote​·​to​·​see​·​what​·​others​·​are​​·​saying.​·​The​​​·​saying​·​and​·​their​​·​leaders​​·​-​​·​this​·​is​·​important​​·​-​​·​don't​·​know​·​the​·​questions​·​that​·​are​·​going​·​to​·​be​·​put​·​to​·​them​​·​tonight.​·​So,​​​·​tonight​·​so​​·​first​·​to​·​face​·​our​​·​audience,​​​·​audience​​·​please​·​welcome​·​the​·​leader​·​of​·​the​​·​Conservative​·​Party,​​​·​conservative​·​party​​·​the
...

Normalize

You can see that a lot of the differences are due to capitalization and punctuation. Because we are only interested in the correct identification of words, these types of differences should not count as errors. To get a more accurate WER, we will remove punctuation marks and convert all letters to lowercase. We will do this for the reference and both hypothesis files by using the benchmarkstt-tools normalization subcommand again, with two rules: the built-in --lowercase rule and the --regex rule:

benchmarkstt-tools normalization -i qt_reference.txt -o qt_reference_normalized.txt --lowercase --regex "[,.-]" " "

benchmarkstt-tools normalization -i qt_kaldi_hypothesis.txt -o qt_kaldi_hypothesis_normalized.txt --lowercase --regex "[,.-]" " "

benchmarkstt-tools normalization -i qt_aws_hypothesis.txt -o qt_aws_hypothesis_normalized.txt --lowercase --regex "[,.-]" " "

We now have normalized versions of the reference and two hypothesis files.

Benchmark again

Let's run the benchmarkstt command again, this time calculating WER based on the normalized files:

benchmarkstt --reference qt_reference_normalized.txt --hypothesis qt_kaldi_hypothesis_normalized.txt --wer --diffcounts --worddiffs

The output should look like this (example output is truncated):

wer
===

0.196279

diffcounts
==========

equal: 13229
replace: 1284
insert: 789
delete: 965

worddiffs
=========

Color key: Unchanged Reference Hypothesis

​​·​bbc​·​2017​​·​tonight​·​the​·​prime​·​minister​·​theresa​·​may​·​the​·​leader​·​of​·​the​·​conservative​·​party​·​and​·​the​·​leader​·​of​​·​the​​·​labour​·​party​·​jeremy​·​corbyn​·​face​·​the​·​voters​·​welcome​​·​to​·​question​·​time​​·​so​·​over​·​the​·​next​​·​90​​​·​ninety​​·​minutes​·​the​·​leaders​·​of​·​the​·​two​·​larger​·​parties​·​are​·​going​·​to​·​be​·​quizzed​·​by​·​our​·​audience​·​here​·​in​·​york​·​now​·​this​·​audience​·​is​·​made​·​up​·​like​·​this​·​just​​·​a​·​third​​·​say​·​they​·​intend​·​to​·​vote​​·​conservative​·​next​·​week​​​·​conserve​·​it​​·​the​·​same​​·​number​​​·​numbers​​·​say​·​they're​·​going​·​to​·​vote​·​labour​·​and​·​the​·​rest​·​either​·​support​·​other​·​parties​·​or​·​have​·​yet​·​to​·​make​·​up​·​their​·​minds​​·​and​​·​as​·​ever​·​you​·​can​·​comment​·​on​​·​all​·​of​·​this​·​from​·​home​​·​either​·​on​·​twitter​·​our​·​hashtag​·​is​​·​#bbcqt​·​we're​​​·​bbc​·​two​·​were​​·​also​·​on​·​facebook​·​as​·​usual​·​and​·​our​·​text​·​number​·​is​​·​83981​​​·​a​·​three​·​nine​·​eight​·​one​​·​push​·​the​·​red​·​button​·​on​·​your​·​remote​·​to​·​see​·​what​·​others​·​are​·​saying​​·​the​​​·​and​·​their​​·​leaders​·​this​·​is​·​important​·​don't​·​know​·​the​·​questions​·​that​·​are​·​going​·​to​·​be​·​put​·​to​·​them​·​tonight​·​so​·​first​·​to​·​face​·​our​·​audience​·​please​·​welcome​·​the​·​leader​·​of​·​the​·​conservative​·​party
...

You can see that this time there are fewer differences between the reference and hypothesis. Accordingly, the WER is much lower for both hypotheses. The transcript with the lower WER is closer to the reference made from subtitles.

Do it all in one step!

Above, we used two commands: benchmarkstt-tools for the normalization and benchmarkstt for calculating the WER. But we can combine all these steps into a single command using a rules file and a config file that references it.

First, let's create a file for the regex normalization rules. Create a text document with this content:

# Replace XML tags with a space
"</?[?!\[\]a-zA-Z][^>]*>"," "
# Replace punctuation with a space
"[,.-]"," "

Save this file as rules.regex.

Now let's create a config file that contains all the normalization rules. They must be listed under the [normalization] section (in this release, there is only one implemented section). The section references the regex rules file we created above, and also includes one of the built-in rules:

[normalization]
# Load regex rules file and tell the processor it's a regex type
Regex rules.regex
# Built in rule
lowercase

Save the above as config.conf. These rules will be applied to both hypothesis and reference, in the order in which they are listed.

Now run benchmarkstt with the --conf argument. We also need to tell the tool to treat the XML as plain text, otherwise it will look for an xml processor and fail. We do this with the reference type argument --reference-type:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_kaldi_hypothesis.txt --config config.conf --wer

Output:

wer
===

0.196279

And we do the same for the AWS transcript, this time using the short form for arguments:

benchmarkstt -r qt_subs.xml -rt plaintext -h qt_aws_hypothesis.txt --config config.conf --wer

Output:

wer
===

0.239889

You now have WER scores for each of the machine-generated transcripts, calculated against a subtitles reference file.

As a next step, you could add more normalization rules or implement your own metrics or normalizer classes and submit them back to this project.

Word Error Rate variants

In this tutorial we used the WER parameter with the mode argument omitted, defaulting to strict WER variant. This variant uses Python's built-in diff algorithm in the calculation of the WER, which is stricter and results in a slightly higher WER than the commonly used Levenshtein Distance algorithm (see more detail here).

If you use BenchmarkSTT to compare different engines then this is not a problem since the relative ranking will not be affected. However, for better compatibility with other benchmarking tools, a WER variant that uses the Levenshtein edit distance algorithm is provided. To use it, specify --wer levenshtein.