Tutorial

Word Error Rate and normalization

In this step-by-step tutorial you will compare the Word Error Rate (WER) of two machine-generated transcripts. The WER is calculated against a less-than-perfect reference made from a human-generated subtitle file. You will also use normalization rules to improve the accuracy of the results.

To follow this tutorial you will need a working installation of benchmarkstt and these source files saved to your working folder:

  1. Subtitle file

  2. Transcript generated by AWS

  3. Transcript generated by Kaldi

This demo shows the capabilities of Release 1 of the library, which benchmarks the accuracy of word recognition only. The library supports adding new metrics in future releases. Contributions are welcome.

Creating the plain text reference file

Creating accurate verbatim transcripts for use as reference is time-consuming and expensive. As a quick and easy alternative, we will make a "reference" from a subtitles file. Subtitles are slightly edited and they include additional text like descriptions of sounds and actions, so they are not a verbatim transcription of the speech. Consequently, they are not suitable for calculating absolute WER. However, we are interested in calculating relative WER for illustration purposes only, so this use of subtitles is deemed acceptable.

Warning

Evaluations in this tutorial are not done for the purpose of assessing tools. The use of subtitles as reference will skew the results so they should not be taken as an indication of overall performance or as an endorsement of a particular vendor or engine.

We will use the subtitles file for the BBC's Question Time Brexit debate. This program was chosen for its length (90 minutes) and because live debates are particularly challenging to transcribe.

The subtitles file includes a lot of extra text in XML tags. This text shouldn't be used in the calculation: for both reference and hypotheses, we want to run the tool on plain text only. To strip out the XML tags, we will use the benchmarkstt-tools command, with the normalization subcommand:

benchmarkstt-tools normalization --inputfile qt_subs.xml --outputfile qt_reference.txt --regex "</?[?!\[\]a-zA-Z][^>]*>" " "

The normalization rule --regex takes two parameters: a regular expression pattern and the replacement string.

In this case all XML tags will be replaced with a space. This will result in a lot of space characters, but these are ignored by the diff algorithm later so we don't have to clean these up. --inputfile and --outputfile are the input and output files.

The file qt_reference.txt has been created. You can see that the XML tags are gone, but the file still contains non-dialogue text like 'APPLAUSE'.

For better results you can manually clean up the text, or run the command again with a different normalization rule (not included in this demo). But we will stop the normalization at this point.

We now have a simple text file that will be used as the reference. The next step is to get the machine-generated transcripts for benchmarking.

Creating the plain text hypotheses files

The first release of benchmarkstt does not integrate directly with STT vendors or engines, so transcripts for benchmarking have to be retrieved separately and converted to plain text.

For this demo, two machine transcripts were retrieved for the Question Time audio: from AWS Transcribe and from the BBC's version of Kaldi, an open-source STT framework.

Both AWS and BBC-Kaldi return the transcript in JSON format, with word-level timings. They also contain a field with the entire transcript as a single string, and this is the value we will use (we don't benchmark timings in this version).

To make the hypothesis file for AWS, we will use the transcript JSON field from the transcript generated by AWS, and save it as a new document qt_aws_hypothesis.txt.

We can automate this again using benchmarkstt-tools normalization and a somewhat more complex regex parameter:

benchmarkstt-tools normalization --inputfile qt_aws.json --outputfile qt_aws_hypothesis.txt --regex '^.*"transcript":"([^"]+)".*' '\1'

To make the BBC-Kaldi transcript file we will use the text JSON field from the transcript generated by Kaldi, and save it as a new document qt_kaldi_hypothesis.txt.

Again, benchmarkstt-tools normalization with a --regex argument will be used for this:

benchmarkstt-tools normalization --inputfile qt_kaldi.json --outputfile qt_kaldi_hypothesis.txt --regex '^.*"text":"([^"]+)".*' '\1'

You'll end up with two files similar to these:

  1. Text extracted from AWS transcript

  2. Text extracted from Kaldi transcript

Benchmark!

We can now compare each of the hypothesis files to the reference in order to calculate the Word Error Rate. We process one file at a time, now using the main benchmarkstt command, with two flags: --wer is the metric we are most interested in, while --diffcounts outputs the number of insertions, deletions, substitutions and correct words (the basis for WER calculation).

Calculate WER for AWS Transcribe:

benchmarkstt --reference qt_reference.txt --hypothesis qt_aws_hypothesis.txt --wer --diffcounts

The output should look like this:

wer
===

0.336614

diffcounts
==========

equal: 10919
replace: 2750
insert: 675
delete: 1773

Now calculate the WER and "diff counts" for BBC-Kaldi:

benchmarkstt --reference qt_reference.txt --hypothesis qt_kaldi_hypothesis.txt --wer --diffcounts

The output should look like this:

wer
===

0.379744

diffcounts
==========

equal: 10437
replace: 4006
insert: 859
delete: 999

After running these two commands, you can see that the WER for both transcripts is quite high (around 35%). Let's see the actual differences between the reference and the hypotheses by using the --worddiffs flag:

benchmarkstt --reference qt_reference.txt --hypothesis qt_kaldi_hypothesis.txt --worddiffs

The output should look like this (example output is truncated):

worddiffs
=========

Color key: Unchanged ​Reference​ ​Hypothesis​

​​·​BBC​·​2017​·​Tonight,​​​·​tonight​​·​the​​·​Prime​·​Minister,​·​Theresa​·​May,​​​·​prime​·​minister​·​theresa​·​may​​·​the​·​leader​·​of​·​the​​·​Conservative​·​Party,​​​·​conservative​·​party​​·​and​·​the​·​leader​·​of​​·​Labour​·​Party,​·​Jeremy​·​Corbyn,​​​·​the​·​labour​·​party​·​jeremy​·​corbyn​​·​face​·​the​​·​voters.​·​Welcome​·​to​·​Question​·​Time.​·​So,​​​·​voters​·​welcome​·​so​​·​over​·​the​·​next​​·​90​·​minutes,​​​·​ninety​·​minutes​​·​the​·​leaders​·​of​·​the​·​two​·​larger​·​parties​·​are​·​going​·​to​·​be​·​quizzed​·​by​·​our​·​audience​·​here​·​in​​·​York.​·​Now,​​​·​york​·​now​​·​this​·​audience​·​is​·​made​·​up​·​like​·​this​​·​-​​·​just​​·​a​·​third​​·​say​·​they​·​intend​·​to​·​vote​​·​Conservative​·​next​·​week.​·​The​​​·​conserve​·​it​·​the​​·​same​​·​number​​​·​numbers​​·​say​·​they're​·​going​·​to​·​vote​​·​Labour,​​​·​labour​​·​and​·​the​·​rest​·​either​·​support​·​other​​·​parties,​​​·​parties​​·​or​·​have​·​yet​·​to​·​make​·​up​·​their​​·​minds.​·​As​·​ever,​​​·​minds​·​and​·​as​·​ever​​·​you​·​can​·​comment​·​on​​·​all​·​of​·​this​·​from​·​home​​·​either​·​on​​·​Twitter​·​-​​​·​twitter​​·​our​·​hashtag​·​is​​·​#BBCQT​·​-​·​we're​​​·​bbc​·​two​·​were​​·​also​·​on​​·​Facebook,​​​·​facebook​​·​as​​·​usual,​​​·​usual​​·​and​·​our​·​text​·​number​·​is​​·​83981.​·​Push​​​·​a​·​three​·​nine​·​eight​·​one​·​push​​·​the​·​red​·​button​·​on​·​your​·​remote​·​to​·​see​·​what​·​others​·​are​​·​saying.​·​The​​​·​saying​·​and​·​their​​·​leaders​​·​-​​·​this​·​is​·​important​​·​-​​·​don't​·​know​·​the​·​questions​·​that​·​are​·​going​·​to​·​be​·​put​·​to​·​them​​·​tonight.​·​So,​​​·​tonight​·​so​​·​first​·​to​·​face​·​our​​·​audience,​​​·​audience​​·​please​·​welcome​·​the​·​leader​·​of​·​the​​·​Conservative​·​Party,​​​·​conservative​·​party​​·​the
...

Normalize

You can see that a lot of the differences are due to capitalization and punctuation. Because we are only interested in the correct identification of words, these types of differences should not count as errors. To get a more accurate WER, we will remove punctuation marks and convert all letters to lowercase. We will do this for the reference and both hypothesis files by using the benchmarkstt-tools normalization subcommand again, with two rules: the built-in --lowercase rule and the --regex rule:

benchmarkstt-tools normalization -i qt_reference.txt -o qt_reference_normalized.txt --lowercase --regex "[,.-]" " "

benchmarkstt-tools normalization -i qt_kaldi_hypothesis.txt -o qt_kaldi_hypothesis_normalized.txt --lowercase --regex "[,.-]" " "

benchmarkstt-tools normalization -i qt_aws_hypothesis.txt -o qt_aws_hypothesis_normalized.txt --lowercase --regex "[,.-]" " "

We now have normalized versions of the reference and two hypothesis files.

Benchmark again

Let's run the benchmarkstt command again, this time calculating WER based on the normalized files:

benchmarkstt --reference qt_reference_normalized.txt --hypothesis qt_kaldi_hypothesis_normalized.txt --wer --diffcounts --worddiffs

The output should look like this (example output is truncated):

wer
===

0.196279

diffcounts
==========

equal: 13229
replace: 1284
insert: 789
delete: 965

worddiffs
=========

Color key: Unchanged Reference Hypothesis

​​·​bbc​·​2017​​·​tonight​·​the​·​prime​·​minister​·​theresa​·​may​·​the​·​leader​·​of​·​the​·​conservative​·​party​·​and​·​the​·​leader​·​of​​·​the​​·​labour​·​party​·​jeremy​·​corbyn​·​face​·​the​·​voters​·​welcome​​·​to​·​question​·​time​​·​so​·​over​·​the​·​next​​·​90​​​·​ninety​​·​minutes​·​the​·​leaders​·​of​·​the​·​two​·​larger​·​parties​·​are​·​going​·​to​·​be​·​quizzed​·​by​·​our​·​audience​·​here​·​in​·​york​·​now​·​this​·​audience​·​is​·​made​·​up​·​like​·​this​·​just​​·​a​·​third​​·​say​·​they​·​intend​·​to​·​vote​​·​conservative​·​next​·​week​​​·​conserve​·​it​​·​the​·​same​​·​number​​​·​numbers​​·​say​·​they're​·​going​·​to​·​vote​·​labour​·​and​·​the​·​rest​·​either​·​support​·​other​·​parties​·​or​·​have​·​yet​·​to​·​make​·​up​·​their​·​minds​​·​and​​·​as​·​ever​·​you​·​can​·​comment​·​on​​·​all​·​of​·​this​·​from​·​home​​·​either​·​on​·​twitter​·​our​·​hashtag​·​is​​·​#bbcqt​·​we're​​​·​bbc​·​two​·​were​​·​also​·​on​·​facebook​·​as​·​usual​·​and​·​our​·​text​·​number​·​is​​·​83981​​​·​a​·​three​·​nine​·​eight​·​one​​·​push​·​the​·​red​·​button​·​on​·​your​·​remote​·​to​·​see​·​what​·​others​·​are​·​saying​​·​the​​​·​and​·​their​​·​leaders​·​this​·​is​·​important​·​don't​·​know​·​the​·​questions​·​that​·​are​·​going​·​to​·​be​·​put​·​to​·​them​·​tonight​·​so​·​first​·​to​·​face​·​our​·​audience​·​please​·​welcome​·​the​·​leader​·​of​·​the​·​conservative​·​party
...

You can see that this time there are fewer differences between the reference and hypothesis. Accordingly, the WER is much lower for both hypotheses. The transcript with the lower WER is closer to the reference made from subtitles.

Do it all in one step!

Above, we used two commands: benchmarkstt-tools for the normalization and benchmarkstt for calculating the WER. But we can combine all these steps into a single command using a rules file and a config file that references it.

First, let's create a file for the regex normalization rules. Create a text document with this content:

# Replace XML tags with a space
"</?[?!\[\]a-zA-Z][^>]*>"," "
# Replace punctuation with a space
"[,.-]"," "

Save this file as rules.regex.

Now let's create a config file that contains all the normalization rules. They must be listed under the [normalization] section (in this release, there is only one implemented section). The section references the regex rules file we created above, and also includes one of the built-in rules:

[normalization]
# Load regex rules file and tell the processor it's a regex type
Regex rules.regex
# Built in rule
lowercase

Save the above as config.conf. These rules will be applied to both hypothesis and reference, in the order in which they are listed.

Now run benchmarkstt with the --conf argument. We also need to tell the tool to treat the XML as plain text, otherwise it will look for an xml processor and fail. We do this with the reference type argument --reference-type:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_kaldi_hypothesis.txt --config config.conf --wer

Output:

wer
===

0.196279

And we do the same for the AWS transcript, this time using the short form for arguments:

benchmarkstt -r qt_subs.xml -rt plaintext -h qt_aws_hypothesis.txt --config config.conf --wer

Output:

wer
===

0.239889

You now have WER scores for each of the machine-generated transcripts, calculated against a subtitles reference file.

As a next step, you could add more normalization rules or implement your own metrics or normalizer classes and submit them back to this project.

Word Error Rate variants

In this tutorial we used the WER parameter with the mode argument omitted, defaulting to strict WER variant. This variant uses Python's built-in diff algorithm in the calculation of the WER, which is stricter and results in a slightly higher WER than the commonly used Levenshtein Distance algorithm (see more detail here).

If you use BenchmarkSTT to compare different engines then this is not a problem since the relative ranking will not be affected. However, for better compatibility with other benchmarking tools, a WER variant that uses the Levenshtein edit distance algorithm is provided. To use it, specify --wer levenshtein.

Bag of Entities Error Rate (BEER)

In this tutorial you compute the Bag of Entities Error Rate (BEER) on a machine-generated transcript. It assumes knowledge of the first part of this tutorial.

The Word Error Rate is the standard metric for benchmarking ASR models, but it can be a blunt tool. It treats all words as equally important but in reality some words, like proper nouns and phrases, are more significant than common words. When these are recognized correctly by a model, they should be given more weight in the assessment of the model.

Consider for example this sentence in the reference transcript: 'The European Union headquarters'. If engine A returns 'The European onion headquarters' and engine B returns 'The European Union headache', the Word Error Rate would be similar for both engines since in both cases one word was transcribed inaccurately. But engine B should be 'rewarded' for preserving the phrase 'European Union'. The BEER is the metric that takes such considerations into account.

Another use for this metric is compensating for distortions of WER that are caused by normalization rules. For example, you may convert both reference and hypothesis transcripts to lower case or remove punctuation marks so that they don't affect the WER. In this case, the distinction between 'Theresa May' and 'Theresa may' is lost. But you can instruct BenchmarkSTT to score higher the engine that produced 'Theresa May'.

The BEER is useful to evaluate:

  1. the suitability of transcript files as input to a tagging system,

  2. the performances of STT services on key entities depending on the contexts, for instance highlights and players names for sport events,

  3. the performances of a list of entities automatically selected in the reference text by TF/IDF approach which intend to reflect how important a word is.

An entity is a word or an ordered list of words including capital letters and punctuation. To calculate BEER, BenchmarkSTT needs a list of entities. It does not make this list for you. It is expected that the user will create the list outside of BSTT, manually or by using an NLP library to extract proper nouns from the reference.

BEER definition

The BEER is defined as the error rate per entity with a bag of words approach. In this approach the order of the entities in the documents does not affect the measure.

\[{BEER} \left ( entity \right ) = \frac{ \left | n_{hyp} - n_{ref} \right | }{n_{ref} }\]
\[ \begin{align}\begin{aligned}n_{ref}=\textrm{number of occurrences of entity in the reference document}\\n_{hyp}=\textrm{number of occurrences of entity in the hypothesis document}\end{aligned}\end{align} \]

The weighted averaged BEER of a set of entities e1, e2 ... en measures the global performances of the n entities, a weight wn is attributed to each entity.

\[\begin{aligned} WA\_BEER (e_1, ... e_N) = w_1*BEER (e_1)\frac{L_1}{L} +... + w_N*BEER (e_N)\frac{L_N}{L} \end{aligned}\]
\[L_1=\textrm{number of occurrences of entity 1 in the reference document}\]
\[L=L_1 + ... + L_N\]

The weights being normalised by the tool

\[w_1 + ... + w_N=1\]

Calculating BEER

BenchmarkSTT does not have a built-in list of entities. You must provide your own in a JSON input file defining the list of entities and the weight per entity.

The file has this structure:

{ "entity1" : weight1, "entity2" : weight2, "entity3" : weight2 .. }

Let's create an example list. Save the below list as file entities.json:

{"Theresa May" : 0.5, "Abigail" : 0.5, "EU": 0.75, "Griffin" : 0.5, "I" : 0.25}

We'll also tell BenchmarkSTT to normalize the reference and hypothesis file but without lowercasing both. We do this in the config.conf file:

[normalization]
# Load regex rules file and tell the processor it's a regex type
Regex rules.regex

Now compute the BEER in one line, using the same files from the previous section of this tutorial. The tool provides the BEER and the number of occurrence in the reference file for each entity, with the weighted averaged BEER:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_aws_hypothesis.txt --config config.conf --beer  entities.json
beer
====
Theresa May: {'beer': 0.5, 'occurrence_ref': 2}
Abigail: {'beer': 0.333, 'occurrence_ref': 3}
EU: {'beer': 0.783, 'occurrence_ref': 23}
Griffin: {'beer': 0.0, 'occurrence_ref': 2}
I: {'beer': 0.073, 'occurrence_ref': 301}
w_av_beer: {'beer': 0.024, 'occurrence_ref': 331}

To automate the task, you can generate a JSON result file by adding the -o option:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_aws_hypothesis.txt --config config.conf --beer  entities.json -o json >> beer_aws.json