BenchmarkSTT

https://img.shields.io/github/license/ebu/benchmarkstt.svg GitHub Workflow Status (branch) Documentation Status

About

This is a command line tool for benchmarking Automatic Speech Recognition engines.

It is designed for non-academic production environments, and prioritises ease of use and relative benchmarking over scientific procedure and high-accuracy absolute scoring.

Because of the wide range of languages, algorithms and audio characteristics, no single STT engine can be expected to excel in all circumstances. For this reason, this tool places responsibility on the users to design their own benchmarking procedure and to decide, based on the combination of test data and metrics, which engine is best suited for their particular use case.

Usage examples

Returns the number of word insertions, deletions, replacements and matches for the hypothesis transcript compared to the reference:

benchmarkstt --reference reference.txt --hypothesis hypothesis.txt --diffcounts

Returns the Word Error Rate after lowercasing both reference and hypothesis. This normlization improves the accuracy of the Word Error Rate as it removes diffs that might otherwise be considered errors:

benchmarkstt -r reference.txt -h hypothesis.txt --wer --lowercase

Returns a visual diff after applying all the normalization rules specified in the config file:

benchmarkstt -r reference.txt -h hypothesis.txt --worddiffs --config conf

Further information

This is a collaborative project to create a library for benchmarking AI/ML applications. It was created in response to the needs of broadcasters and providers of Access Services to media organisations, but anyone is welcome to contribute. The group behind this project is the EBU's Media Information Management & AI group.

Currently the group is focussing on Speech-to-Text, but it will consider creating benchmarking tools for other AI/ML services.

For general information about this project, including the motivations and guiding principles, please see the project wiki

To install and start using the tool, go to the documentation.

License

Copyright 2019 EBU

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Installation

BenchmarkSTT requires Python version 3.5 or above. If you wish to make use of the API, Python version 3.6 or above is required.

From PyPI (preferred)

This is the easiest and preferred way of installing benchmarkstt.

  1. Install Python 3.5 or above (latest stable version for your OS is preferred):

    Use the guides available at The Hitchhiker’s Guide to Python

Warning

Some dependent packages require python-dev to be installed. On Debian-based systems this can be done using e.g. apt-get install python3.7-dev for Python 3.7, Red Hat-based systems would use e.g. yum install python3.7-devel.

  1. Install the package using pip, this will also install all requirements:

    python3 -m pip install benchmarkstt
    
  2. Test and use

    BenchmarkSTT should now be installed and usable.

    $> benchmarkstt --version
    benchmarkstt: 1.1
    $> echo IT WORKS! | benchmarkstt-tools normalization --lowercase
    it works!

    Use the --help option to get all available options:

    benchmarkstt --help
    benchmarkstt-tools normalization --help
    

    See Usage for more information on how to use.

From the repository

For building the documentation locally and working with a development copy see Development

Removing benchmarkstt

BenchmarkSTT can be easily uninstalled using:

python3 -m pip uninstall benchmarkstt

Docker

See instructions for setting up and running as a docker image at:

Using docker

Warning

This assumes docker is already installed on your system.

Build the image
  1. Download the code from github at https://github.com/ebu/benchmarkstt/archive/master.zip

  2. Unzip the file

  3. Inside the benchmarkstt folder run:

    docker build -t benchmarkstt:latest .
    
Run the image

You can change port for the api, just change the 1234 to the port you want to bind to:

docker run --name benchmarkstt -p 1234:8080 --rm benchmarkstt:latest

The json-rpc api is then automatically available at: http://localhost:1234/api

While the docker image is running you can use the CLI application like this (see Usage for more information about which commands are available):

docker exec -it benchmarkstt benchmarkstt --version
docker exec -it benchmarkstt benchmarkstt --help
docker exec -it benchmarkstt benchmarkstt-tools --help
Stopping the image

You can stop the docker image running by running:

docker stop benchmarkstt

Tutorial

Word Error Rate and normalization

In this step-by-step tutorial you will compare the Word Error Rate (WER) of two machine-generated transcripts. The WER is calculated against a less-than-perfect reference made from a human-generated subtitle file. You will also use normalization rules to improve the accuracy of the results.

To follow this tutorial you will need a working installation of benchmarkstt and these source files saved to your working folder:

  1. Subtitle file

  2. Transcript generated by AWS

  3. Transcript generated by Kaldi

This demo shows the capabilities of Release 1 of the library, which benchmarks the accuracy of word recognition only. The library supports adding new metrics in future releases. Contributions are welcome.

Creating the plain text reference file

Creating accurate verbatim transcripts for use as reference is time-consuming and expensive. As a quick and easy alternative, we will make a "reference" from a subtitles file. Subtitles are slightly edited and they include additional text like descriptions of sounds and actions, so they are not a verbatim transcription of the speech. Consequently, they are not suitable for calculating absolute WER. However, we are interested in calculating relative WER for illustration purposes only, so this use of subtitles is deemed acceptable.

Warning

Evaluations in this tutorial are not done for the purpose of assessing tools. The use of subtitles as reference will skew the results so they should not be taken as an indication of overall performance or as an endorsement of a particular vendor or engine.

We will use the subtitles file for the BBC's Question Time Brexit debate. This program was chosen for its length (90 minutes) and because live debates are particularly challenging to transcribe.

The subtitles file includes a lot of extra text in XML tags. This text shouldn't be used in the calculation: for both reference and hypotheses, we want to run the tool on plain text only. To strip out the XML tags, we will use the benchmarkstt-tools command, with the normalization subcommand:

benchmarkstt-tools normalization --inputfile qt_subs.xml --outputfile qt_reference.txt --regex "</?[?!\[\]a-zA-Z][^>]*>" " "

The normalization rule --regex takes two parameters: a regular expression pattern and the replacement string.

In this case all XML tags will be replaced with a space. This will result in a lot of space characters, but these are ignored by the diff algorithm later so we don't have to clean these up. --inputfile and --outputfile are the input and output files.

The file qt_reference.txt has been created. You can see that the XML tags are gone, but the file still contains non-dialogue text like 'APPLAUSE'.

For better results you can manually clean up the text, or run the command again with a different normalization rule (not included in this demo). But we will stop the normalization at this point.

We now have a simple text file that will be used as the reference. The next step is to get the machine-generated transcripts for benchmarking.

Creating the plain text hypotheses files

The first release of benchmarkstt does not integrate directly with STT vendors or engines, so transcripts for benchmarking have to be retrieved separately and converted to plain text.

For this demo, two machine transcripts were retrieved for the Question Time audio: from AWS Transcribe and from the BBC's version of Kaldi, an open-source STT framework.

Both AWS and BBC-Kaldi return the transcript in JSON format, with word-level timings. They also contain a field with the entire transcript as a single string, and this is the value we will use (we don't benchmark timings in this version).

To make the hypothesis file for AWS, we will use the transcript JSON field from the transcript generated by AWS, and save it as a new document qt_aws_hypothesis.txt.

We can automate this again using benchmarkstt-tools normalization and a somewhat more complex regex parameter:

benchmarkstt-tools normalization --inputfile qt_aws.json --outputfile qt_aws_hypothesis.txt --regex '^.*"transcript":"([^"]+)".*' '\1'

To make the BBC-Kaldi transcript file we will use the text JSON field from the transcript generated by Kaldi, and save it as a new document qt_kaldi_hypothesis.txt.

Again, benchmarkstt-tools normalization with a --regex argument will be used for this:

benchmarkstt-tools normalization --inputfile qt_kaldi.json --outputfile qt_kaldi_hypothesis.txt --regex '^.*"text":"([^"]+)".*' '\1'

You'll end up with two files similar to these:

  1. Text extracted from AWS transcript

  2. Text extracted from Kaldi transcript

Benchmark!

We can now compare each of the hypothesis files to the reference in order to calculate the Word Error Rate. We process one file at a time, now using the main benchmarkstt command, with two flags: --wer is the metric we are most interested in, while --diffcounts outputs the number of insertions, deletions, substitutions and correct words (the basis for WER calculation).

Calculate WER for AWS Transcribe:

benchmarkstt --reference qt_reference.txt --hypothesis qt_aws_hypothesis.txt --wer --diffcounts

The output should look like this:

wer
===

0.336614

diffcounts
==========

equal: 10919
replace: 2750
insert: 675
delete: 1773

Now calculate the WER and "diff counts" for BBC-Kaldi:

benchmarkstt --reference qt_reference.txt --hypothesis qt_kaldi_hypothesis.txt --wer --diffcounts

The output should look like this:

wer
===

0.379744

diffcounts
==========

equal: 10437
replace: 4006
insert: 859
delete: 999

After running these two commands, you can see that the WER for both transcripts is quite high (around 35%). Let's see the actual differences between the reference and the hypotheses by using the --worddiffs flag:

benchmarkstt --reference qt_reference.txt --hypothesis qt_kaldi_hypothesis.txt --worddiffs

The output should look like this (example output is truncated):

worddiffs
=========

Color key: Unchanged ​Reference​ ​Hypothesis​

​​·​BBC​·​2017​·​Tonight,​​​·​tonight​​·​the​​·​Prime​·​Minister,​·​Theresa​·​May,​​​·​prime​·​minister​·​theresa​·​may​​·​the​·​leader​·​of​·​the​​·​Conservative​·​Party,​​​·​conservative​·​party​​·​and​·​the​·​leader​·​of​​·​Labour​·​Party,​·​Jeremy​·​Corbyn,​​​·​the​·​labour​·​party​·​jeremy​·​corbyn​​·​face​·​the​​·​voters.​·​Welcome​·​to​·​Question​·​Time.​·​So,​​​·​voters​·​welcome​·​so​​·​over​·​the​·​next​​·​90​·​minutes,​​​·​ninety​·​minutes​​·​the​·​leaders​·​of​·​the​·​two​·​larger​·​parties​·​are​·​going​·​to​·​be​·​quizzed​·​by​·​our​·​audience​·​here​·​in​​·​York.​·​Now,​​​·​york​·​now​​·​this​·​audience​·​is​·​made​·​up​·​like​·​this​​·​-​​·​just​​·​a​·​third​​·​say​·​they​·​intend​·​to​·​vote​​·​Conservative​·​next​·​week.​·​The​​​·​conserve​·​it​·​the​​·​same​​·​number​​​·​numbers​​·​say​·​they're​·​going​·​to​·​vote​​·​Labour,​​​·​labour​​·​and​·​the​·​rest​·​either​·​support​·​other​​·​parties,​​​·​parties​​·​or​·​have​·​yet​·​to​·​make​·​up​·​their​​·​minds.​·​As​·​ever,​​​·​minds​·​and​·​as​·​ever​​·​you​·​can​·​comment​·​on​​·​all​·​of​·​this​·​from​·​home​​·​either​·​on​​·​Twitter​·​-​​​·​twitter​​·​our​·​hashtag​·​is​​·​#BBCQT​·​-​·​we're​​​·​bbc​·​two​·​were​​·​also​·​on​​·​Facebook,​​​·​facebook​​·​as​​·​usual,​​​·​usual​​·​and​·​our​·​text​·​number​·​is​​·​83981.​·​Push​​​·​a​·​three​·​nine​·​eight​·​one​·​push​​·​the​·​red​·​button​·​on​·​your​·​remote​·​to​·​see​·​what​·​others​·​are​​·​saying.​·​The​​​·​saying​·​and​·​their​​·​leaders​​·​-​​·​this​·​is​·​important​​·​-​​·​don't​·​know​·​the​·​questions​·​that​·​are​·​going​·​to​·​be​·​put​·​to​·​them​​·​tonight.​·​So,​​​·​tonight​·​so​​·​first​·​to​·​face​·​our​​·​audience,​​​·​audience​​·​please​·​welcome​·​the​·​leader​·​of​·​the​​·​Conservative​·​Party,​​​·​conservative​·​party​​·​the
...
Normalize

You can see that a lot of the differences are due to capitalization and punctuation. Because we are only interested in the correct identification of words, these types of differences should not count as errors. To get a more accurate WER, we will remove punctuation marks and convert all letters to lowercase. We will do this for the reference and both hypothesis files by using the benchmarkstt-tools normalization subcommand again, with two rules: the built-in --lowercase rule and the --regex rule:

benchmarkstt-tools normalization -i qt_reference.txt -o qt_reference_normalized.txt --lowercase --regex "[,.-]" " "

benchmarkstt-tools normalization -i qt_kaldi_hypothesis.txt -o qt_kaldi_hypothesis_normalized.txt --lowercase --regex "[,.-]" " "

benchmarkstt-tools normalization -i qt_aws_hypothesis.txt -o qt_aws_hypothesis_normalized.txt --lowercase --regex "[,.-]" " "

We now have normalized versions of the reference and two hypothesis files.

Benchmark again

Let's run the benchmarkstt command again, this time calculating WER based on the normalized files:

benchmarkstt --reference qt_reference_normalized.txt --hypothesis qt_kaldi_hypothesis_normalized.txt --wer --diffcounts --worddiffs

The output should look like this (example output is truncated):

wer
===

0.196279

diffcounts
==========

equal: 13229
replace: 1284
insert: 789
delete: 965

worddiffs
=========

Color key: Unchanged Reference Hypothesis

​​·​bbc​·​2017​​·​tonight​·​the​·​prime​·​minister​·​theresa​·​may​·​the​·​leader​·​of​·​the​·​conservative​·​party​·​and​·​the​·​leader​·​of​​·​the​​·​labour​·​party​·​jeremy​·​corbyn​·​face​·​the​·​voters​·​welcome​​·​to​·​question​·​time​​·​so​·​over​·​the​·​next​​·​90​​​·​ninety​​·​minutes​·​the​·​leaders​·​of​·​the​·​two​·​larger​·​parties​·​are​·​going​·​to​·​be​·​quizzed​·​by​·​our​·​audience​·​here​·​in​·​york​·​now​·​this​·​audience​·​is​·​made​·​up​·​like​·​this​·​just​​·​a​·​third​​·​say​·​they​·​intend​·​to​·​vote​​·​conservative​·​next​·​week​​​·​conserve​·​it​​·​the​·​same​​·​number​​​·​numbers​​·​say​·​they're​·​going​·​to​·​vote​·​labour​·​and​·​the​·​rest​·​either​·​support​·​other​·​parties​·​or​·​have​·​yet​·​to​·​make​·​up​·​their​·​minds​​·​and​​·​as​·​ever​·​you​·​can​·​comment​·​on​​·​all​·​of​·​this​·​from​·​home​​·​either​·​on​·​twitter​·​our​·​hashtag​·​is​​·​#bbcqt​·​we're​​​·​bbc​·​two​·​were​​·​also​·​on​·​facebook​·​as​·​usual​·​and​·​our​·​text​·​number​·​is​​·​83981​​​·​a​·​three​·​nine​·​eight​·​one​​·​push​·​the​·​red​·​button​·​on​·​your​·​remote​·​to​·​see​·​what​·​others​·​are​·​saying​​·​the​​​·​and​·​their​​·​leaders​·​this​·​is​·​important​·​don't​·​know​·​the​·​questions​·​that​·​are​·​going​·​to​·​be​·​put​·​to​·​them​·​tonight​·​so​·​first​·​to​·​face​·​our​·​audience​·​please​·​welcome​·​the​·​leader​·​of​·​the​·​conservative​·​party
...

You can see that this time there are fewer differences between the reference and hypothesis. Accordingly, the WER is much lower for both hypotheses. The transcript with the lower WER is closer to the reference made from subtitles.

Do it all in one step!

Above, we used two commands: benchmarkstt-tools for the normalization and benchmarkstt for calculating the WER. But we can combine all these steps into a single command using a rules file and a config file that references it.

First, let's create a file for the regex normalization rules. Create a text document with this content:

# Replace XML tags with a space
"</?[?!\[\]a-zA-Z][^>]*>"," "
# Replace punctuation with a space
"[,.-]"," "

Save this file as rules.regex.

Now let's create a config file that contains all the normalization rules. They must be listed under the [normalization] section (in this release, there is only one implemented section). The section references the regex rules file we created above, and also includes one of the built-in rules:

[normalization]
# Load regex rules file and tell the processor it's a regex type
Regex rules.regex
# Built in rule
lowercase

Save the above as config.conf. These rules will be applied to both hypothesis and reference, in the order in which they are listed.

Now run benchmarkstt with the --conf argument. We also need to tell the tool to treat the XML as plain text, otherwise it will look for an xml processor and fail. We do this with the reference type argument --reference-type:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_kaldi_hypothesis.txt --config config.conf --wer

Output:

wer
===

0.196279

And we do the same for the AWS transcript, this time using the short form for arguments:

benchmarkstt -r qt_subs.xml -rt plaintext -h qt_aws_hypothesis.txt --config config.conf --wer

Output:

wer
===

0.239889

You now have WER scores for each of the machine-generated transcripts, calculated against a subtitles reference file.

As a next step, you could add more normalization rules or implement your own metrics or normalizer classes and submit them back to this project.

Word Error Rate variants

In this tutorial we used the WER parameter with the mode argument omitted, defaulting to strict WER variant. This variant uses Python's built-in diff algorithm in the calculation of the WER, which is stricter and results in a slightly higher WER than the commonly used Levenshtein Distance algorithm (see more detail here).

If you use BenchmarkSTT to compare different engines then this is not a problem since the relative ranking will not be affected. However, for better compatibility with other benchmarking tools, a WER variant that uses the Levenshtein edit distance algorithm is provided. To use it, specify --wer levenshtein.

Bag of Entities Error Rate (BEER)

In this tutorial you compute the Bag of Entities Error Rate (BEER) on a machine-generated transcript. It assumes knowledge of the first part of this tutorial.

The Word Error Rate is the standard metric for benchmarking ASR models, but it can be a blunt tool. It treats all words as equally important but in reality some words, like proper nouns and phrases, are more significant than common words. When these are recognized correctly by a model, they should be given more weight in the assessment of the model.

Consider for example this sentence in the reference transcript: 'The European Union headquarters'. If engine A returns 'The European onion headquarters' and engine B returns 'The European Union headache', the Word Error Rate would be similar for both engines since in both cases one word was transcribed inaccurately. But engine B should be 'rewarded' for preserving the phrase 'European Union'. The BEER is the metric that takes such considerations into account.

Another use for this metric is compensating for distortions of WER that are caused by normalization rules. For example, you may convert both reference and hypothesis transcripts to lower case or remove punctuation marks so that they don't affect the WER. In this case, the distinction between 'Theresa May' and 'Theresa may' is lost. But you can instruct BenchmarkSTT to score higher the engine that produced 'Theresa May'.

The BEER is useful to evaluate:

  1. the suitability of transcript files as input to a tagging system,

  2. the performances of STT services on key entities depending on the contexts, for instance highlights and players names for sport events,

  3. the performances of a list of entities automatically selected in the reference text by TF/IDF approach which intend to reflect how important a word is.

An entity is a word or an ordered list of words including capital letters and punctuation. To calculate BEER, BenchmarkSTT needs a list of entities. It does not make this list for you. It is expected that the user will create the list outside of BSTT, manually or by using an NLP library to extract proper nouns from the reference.

BEER definition

The BEER is defined as the error rate per entity with a bag of words approach. In this approach the order of the entities in the documents does not affect the measure.

\[{BEER} \left ( entity \right ) = \frac{ \left | n_{hyp} - n_{ref} \right | }{n_{ref} }\]
\[ \begin{align}\begin{aligned}n_{ref}=\textrm{number of occurrences of entity in the reference document}\\n_{hyp}=\textrm{number of occurrences of entity in the hypothesis document}\end{aligned}\end{align} \]

The weighted averaged BEER of a set of entities e1, e2 ... en measures the global performances of the n entities, a weight wn is attributed to each entity.

\[\begin{aligned} WA\_BEER (e_1, ... e_N) = w_1*BEER (e_1)\frac{L_1}{L} +... + w_N*BEER (e_N)\frac{L_N}{L} \end{aligned}\]
\[L_1=\textrm{number of occurrences of entity 1 in the reference document}\]
\[L=L_1 + ... + L_N\]

The weights being normalised by the tool

\[w_1 + ... + w_N=1\]
Calculating BEER

BenchmarkSTT does not have a built-in list of entities. You must provide your own in a JSON input file defining the list of entities and the weight per entity.

The file has this structure:

{ "entity1" : weight1, "entity2" : weight2, "entity3" : weight2 .. }

Let's create an example list. Save the below list as file entities.json:

{"Theresa May" : 0.5, "Abigail" : 0.5, "EU": 0.75, "Griffin" : 0.5, "I" : 0.25}

We'll also tell BenchmarkSTT to normalize the reference and hypothesis file but without lowercasing both. We do this in the config.conf file:

[normalization]
# Load regex rules file and tell the processor it's a regex type
Regex rules.regex

Now compute the BEER in one line, using the same files from the previous section of this tutorial. The tool provides the BEER and the number of occurrence in the reference file for each entity, with the weighted averaged BEER:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_aws_hypothesis.txt --config config.conf --beer  entities.json
beer
====
Theresa May: {'beer': 0.5, 'occurrence_ref': 2}
Abigail: {'beer': 0.333, 'occurrence_ref': 3}
EU: {'beer': 0.783, 'occurrence_ref': 23}
Griffin: {'beer': 0.0, 'occurrence_ref': 2}
I: {'beer': 0.073, 'occurrence_ref': 301}
w_av_beer: {'beer': 0.024, 'occurrence_ref': 331}

To automate the task, you can generate a JSON result file by adding the -o option:

benchmarkstt --reference qt_subs.xml --reference-type plaintext --hypothesis qt_aws_hypothesis.txt --config config.conf --beer  entities.json -o json >> beer_aws.json

Usage

The tool is accessible as:

Command line tool

usage: benchmarkstt -r REFERENCE -h HYPOTHESIS
                    [-rt {infer,argument,plaintext}]
                    [-ht {infer,argument,plaintext}]
                    [-o {json,markdown,restructuredtext}]
                    [--beer [entities_file]] [--cer [mode] [differ_class]]
                    [--diffcounts [mode] [differ_class]] [--wer [mode]
                    [differ_class]] [--worddiffs [dialect] [differ_class]]
                    [--config file [section] [encoding]]
                    [--file normalizer file [encoding] [path]] [--lowercase]
                    [--regex search replace] [--replace search replace]
                    [--replacewords search replace] [--unidecode] [--log]
                    [--version]
                    [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                    [--load MODULE_NAME [MODULE_NAME ...]] [--help]
named arguments
-r, --reference

File to use as reference

-h, --hypothesis

File to use as hypothesis

-o, --output-format

Possible choices: json, markdown, restructuredtext

Format of the outputted results

Default: "restructuredtext"

--log

show normalization logs (warning: for large files with many normalization rules this will cause a significant performance penalty and a lot of output data)

Default: False

--version

Output benchmarkstt version number

Default: False

--log-level

Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning

--load

Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

reference and hypothesis types

You can specify which file type the --reference/-r and --hypothesis/-h arguments should be treated as.

Available types:

'infer': Load from a given filename. Automatically infer file type from the filename extension. 'argument': Read the argument and treat as plain text (without reading from file) 'plaintext': Load from a given filename. Treat file as Plain text.

-rt, --reference-type

Possible choices: infer, argument, plaintext

Type of reference file

Default: "infer"

-ht, --hypothesis-type

Possible choices: infer, argument, plaintext

Type of hypothesis file

Default: "infer"

available metrics

A list of metrics to calculate. At least one metric needs to be provided.

--beer

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

--cer

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.

--diffcounts

Get the amount of differences between reference and hypothesis

--wer

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.

--worddiffs

Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect

'html'

param differ_class

For future use.

available normalizers

A list of normalizers to execute on the input, can be one or more normalizers which are applied sequentially. The program will automatically find the normalizer in benchmarkstt.normalization.core, then benchmarkstt.normalization and finally in the global namespace.

--config

Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.

Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.

The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).

Additional rules:
  • Normalizer names are case-insensitive.

  • Arguments MAY be wrapped in double quotes.

  • If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.

  • A double quote itself is represented in this quoted argument as two double quotes: "".

The normalization rules are applied top-to-bottom and follow this format:

[normalization]
# This is a comment

# (Normalizer2 has no arguments)
lowercase

# loads regex expressions from regexrules.csv in "utf 8" encoding
regex regexrules.csv "utf 8"

# load another config file, [section1] and [section2]
config configfile.ini section1
config configfile.ini section2

# loads replace expressions from replaces.csv in default encoding
replace     replaces.csv
param file

The config file

param encoding

The file encoding

param section

The subsection of the config file to use, defaults to 'normalization'

example text

"He bravely turned his tail and fled"

example file

"./resources/test/normalizers/configfile.conf"

example encoding

"UTF-8"

example return

"ha bravalY Turnad his tail and flad"

--file

Read one per line and pass it to the given normalizer

param str|class normalizer

Normalizer name (or class)

param file

The file to read rules from

param encoding

The file encoding

example text

"This is an Ex-Parakeet"

example normalizer

"regex"

example file

"./resources/test/normalizers/regex/en_US"

example encoding

"UTF-8"

example return

"This is an Ex Parrot"

--lowercase

Lowercase the text

example text

"Easy, Mungo, easy... Mungo..."

example return

"easy, mungo, easy... mungo..."

--regex

Simple regex replace. By default the pattern is interpreted case-sensitive.

Case-insensitivity is supported by adding inline modifiers.

You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...

Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":

search

replace

(?i)(h)a

\1e

No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.

Eg. would replace "New<CRLF>line" to "newline":

search

replace

(?msi)new.line

newline

example text

"HAHA! Hahaha!"

example search

'(?i)(h)a'

example replace

'\1e'

example return

"HeHe! Hehehe!"

--replace

Simple search replace

param search

Text to search for

param replace

Text to replace with

example text

"Nudge nudge!"

example search

"nudge"

example replace

"wink"

example return

"Nudge wink!"

--replacewords

Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..

param search

Word to search for

param replace

Replace with

example text

"She has a heart of formica"

example search

"a"

example replace

"the"

example return

"She has the heart of formica"

--unidecode

Unidecode characters to ASCII form, see Python's Unidecode package for more info.

example text

"𝖂𝖊𝖓𝖓 𝖎𝖘𝖙 𝖉𝖆𝖘 𝕹𝖚𝖓𝖘𝖙ü𝖈𝖐 𝖌𝖎𝖙 𝖚𝖓𝖉 𝕾𝖑𝖔𝖙𝖊𝖗𝖒𝖊𝖞𝖊𝖗?"

example return

"Wenn ist das Nunstuck git und Slotermeyer?"

Implementation

The benchmarkstt command line tool links the different modules (input, normalization, metrics, etc.) in the following way:

CLI flow
Additional tools

Some additional helpful tools are available through benchmarkstt-tools, which provides these subcommands:

Subcommand api

See API for more information on usage and available jsonrpc methods.

Make benchmarkstt available through a rudimentary JSON-RPC interface

Attention

Only supported for Python versions 3.6 and above

usage: benchmarkstt-tools api [--debug] [--host HOST] [--port PORT]
                              [--entrypoint ENTRYPOINT] [--list-methods]
                              [--with-explorer]
                              [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                              [--load MODULE_NAME [MODULE_NAME ...]] [--help]
Named Arguments
--debug

Run in debug mode

Default: False

--host

Hostname or ip to serve api

--port

Port used by the server

Default: 8080

--entrypoint

The jsonrpc api address

Default: "/api"

--list-methods

List the available jsonrpc methods

Default: False

--with-explorer

Also create the explorer to test api calls with, this is a rudimentary feature currently only meant for testing and debugging. Warning: the API explorer is provided as-is, without any tests or code reviews. This is marked as a low-priority feature.

Default: False

--log-level

Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning

--load

Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

Subcommand normalization

Apply normalization to given input

usage: benchmarkstt-tools normalization [--log] [-i file] [-o file]
                                        [--config file [section] [encoding]]
                                        [--file normalizer file [encoding]
                                        [path]] [--lowercase]
                                        [--regex search replace]
                                        [--replace search replace]
                                        [--replacewords search replace]
                                        [--unidecode]
                                        [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                                        [--load MODULE_NAME [MODULE_NAME ...]]
                                        [--help]
Named Arguments
--log

show normalization logs (warning: for large files with many normalization rules this will cause a significant performance penalty and a lot of output data)

Default: False

--log-level

Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning

--load

Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

input and output files

You can provide multiple input and output files, each preceded by -i and -o respectively. If no input file is given, only one output file can be used. If using both multiple input and output files there should be an equal amount of each. Each processed input file will then be written to the corresponding output file.

-i, --inputfile

read input from this file, defaults to STDIN

-o, --outputfile

write output to this file, defaults to STDOUT

available normalizers

A list of normalizers to execute on the input, can be one or more normalizers which are applied sequentially. The program will automatically find the normalizer in benchmarkstt.normalization.core, then benchmarkstt.normalization and finally in the global namespace.

--config

Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.

Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.

The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).

Additional rules:
  • Normalizer names are case-insensitive.

  • Arguments MAY be wrapped in double quotes.

  • If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.

  • A double quote itself is represented in this quoted argument as two double quotes: "".

The normalization rules are applied top-to-bottom and follow this format:

[normalization]
# This is a comment

# (Normalizer2 has no arguments)
lowercase

# loads regex expressions from regexrules.csv in "utf 8" encoding
regex regexrules.csv "utf 8"

# load another config file, [section1] and [section2]
config configfile.ini section1
config configfile.ini section2

# loads replace expressions from replaces.csv in default encoding
replace     replaces.csv
param file

The config file

param encoding

The file encoding

param section

The subsection of the config file to use, defaults to 'normalization'

example text

"He bravely turned his tail and fled"

example file

"./resources/test/normalizers/configfile.conf"

example encoding

"UTF-8"

example return

"ha bravalY Turnad his tail and flad"

--file

Read one per line and pass it to the given normalizer

param str|class normalizer

Normalizer name (or class)

param file

The file to read rules from

param encoding

The file encoding

example text

"This is an Ex-Parakeet"

example normalizer

"regex"

example file

"./resources/test/normalizers/regex/en_US"

example encoding

"UTF-8"

example return

"This is an Ex Parrot"

--lowercase

Lowercase the text

example text

"Easy, Mungo, easy... Mungo..."

example return

"easy, mungo, easy... mungo..."

--regex

Simple regex replace. By default the pattern is interpreted case-sensitive.

Case-insensitivity is supported by adding inline modifiers.

You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...

Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":

search

replace

(?i)(h)a

\1e

No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.

Eg. would replace "New<CRLF>line" to "newline":

search

replace

(?msi)new.line

newline

example text

"HAHA! Hahaha!"

example search

'(?i)(h)a'

example replace

'\1e'

example return

"HeHe! Hehehe!"

--replace

Simple search replace

param search

Text to search for

param replace

Text to replace with

example text

"Nudge nudge!"

example search

"nudge"

example replace

"wink"

example return

"Nudge wink!"

--replacewords

Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..

param search

Word to search for

param replace

Replace with

example text

"She has a heart of formica"

example search

"a"

example replace

"the"

example return

"She has the heart of formica"

--unidecode

Unidecode characters to ASCII form, see Python's Unidecode package for more info.

example text

"𝖂𝖊𝖓𝖓 𝖎𝖘𝖙 𝖉𝖆𝖘 𝕹𝖚𝖓𝖘𝖙ü𝖈𝖐 𝖌𝖎𝖙 𝖚𝖓𝖉 𝕾𝖑𝖔𝖙𝖊𝖗𝖒𝖊𝖞𝖊𝖗?"

example return

"Wenn ist das Nunstuck git und Slotermeyer?"

Subcommand metrics

Calculate metrics based on the comparison of a hypothesis with a reference.

usage: benchmarkstt-tools metrics -r REFERENCE -h HYPOTHESIS
                                  [-rt {infer,argument,plaintext}]
                                  [-ht {infer,argument,plaintext}]
                                  [-o {json,markdown,restructuredtext}]
                                  [--beer [entities_file]] [--cer [mode]
                                  [differ_class]] [--diffcounts [mode]
                                  [differ_class]] [--wer [mode]
                                  [differ_class]] [--worddiffs [dialect]
                                  [differ_class]]
                                  [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                                  [--load MODULE_NAME [MODULE_NAME ...]]
                                  [--help]
Named Arguments
-r, --reference

File to use as reference

-h, --hypothesis

File to use as hypothesis

-o, --output-format

Possible choices: json, markdown, restructuredtext

Format of the outputted results

Default: "restructuredtext"

--log-level

Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning

--load

Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

reference and hypothesis types

You can specify which file type the --reference/-r and --hypothesis/-h arguments should be treated as.

Available types:

'infer': Load from a given filename. Automatically infer file type from the filename extension. 'argument': Read the argument and treat as plain text (without reading from file) 'plaintext': Load from a given filename. Treat file as Plain text.

-rt, --reference-type

Possible choices: infer, argument, plaintext

Type of reference file

Default: "infer"

-ht, --hypothesis-type

Possible choices: infer, argument, plaintext

Type of hypothesis file

Default: "infer"

available metrics

A list of metrics to calculate. At least one metric needs to be provided.

--beer

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

--cer

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.

--diffcounts

Get the amount of differences between reference and hypothesis

--wer

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.

--worddiffs

Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect

'html'

param differ_class

For future use.

Bash completion

Bash completion is supported through argcomplete.

Setting up bash completion

If you use bash as your shell, benchmarkstt and benchmarkstt-tools can use argcomplete for auto-completion.

For this argcomplete needs to be installed and enabled.

Installing argcomplete
  1. Install argcomplete using:

    python3 -m pip install argcomplete
    
  2. For global activation of all argcomplete enabled python applications, run:

    activate-global-python-argcomplete
    
Alternative argcomplete configuration
  1. For permanent (but not global) benchmarkstt activation, use:

    register-python-argcomplete benchmarkstt >> ~/.bashrc
    register-python-argcomplete benchmarkstt-tools >> ~/.bashrc
    
  2. For one-time activation of argcomplete for benchmarkstt only, use:

    eval "$(register-python-argcomplete benchmarkstt; register-python-argcomplete benchmarkstt-tools)"
    

API

BenchmarkSTT exposes its functionality through a JSON-RPC api.

Attention

Only supported for Python versions 3.6 and above!

Starting the server

You can launch a server to make the api available via:

Usage

All requests must be HTTP POST requests, with the content containing valid JSON.

Using curl, for example:

curl -X POST \
  http://localhost:8080/api \
  -H 'Content-Type: application/json-rpc' \
  -d '{
    "jsonrpc": "2.0",
    "method": "help",
    "id": null
}'

If you started the service with parameter --with-explorer (see Subcommand api), you can easily test the available JSON-RPC api calls by visiting the api url (eg. http://localhost:8080/api in the above example).

Important

The API explorer is provided as-is, without any tests or code reviews. This is marked as a low-priority feature.

Available JSON-RPC methods

Attention

Only supported for Python versions 3.6 and above

version

Get the version of benchmarkstt

return str

BenchmarkSTT version

list.normalization

Get a list of available core normalization

return object

With key being the normalization name, and value its description

normalization.config

Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.

Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.

The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).

Additional rules:
  • Normalizer names are case-insensitive.

  • Arguments MAY be wrapped in double quotes.

  • If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.

  • A double quote itself is represented in this quoted argument as two double quotes: "".

The normalization rules are applied top-to-bottom and follow this format:

[normalization]
# This is a comment

# (Normalizer2 has no arguments)
lowercase

# loads regex expressions from regexrules.csv in "utf 8" encoding
regex regexrules.csv "utf 8"

# load another config file, [section1] and [section2]
config configfile.ini section1
config configfile.ini section2

# loads replace expressions from replaces.csv in default encoding
replace     replaces.csv
param file

The config file

param encoding

The file encoding

param section

The subsection of the config file to use, defaults to 'normalization'

example text

"He bravely turned his tail and fled"

example file

"./resources/test/normalizers/configfile.conf"

example encoding

"UTF-8"

example return

"ha bravalY Turnad his tail and flad"

param text

The text to normalize

param bool return_logs

Return normalization logs

normalization.file

Read one per line and pass it to the given normalizer

param str|class normalizer

Normalizer name (or class)

param file

The file to read rules from

param encoding

The file encoding

example text

"This is an Ex-Parakeet"

example normalizer

"regex"

example file

"./resources/test/normalizers/regex/en_US"

example encoding

"UTF-8"

example return

"This is an Ex Parrot"

param text

The text to normalize

param bool return_logs

Return normalization logs

normalization.lowercase

Lowercase the text

example text

"Easy, Mungo, easy... Mungo..."

example return

"easy, mungo, easy... mungo..."

param text

The text to normalize

param bool return_logs

Return normalization logs

normalization.regex

Simple regex replace. By default the pattern is interpreted case-sensitive.

Case-insensitivity is supported by adding inline modifiers.

You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...

Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":

search

replace

(?i)(h)a

\1e

No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.

Eg. would replace "New<CRLF>line" to "newline":

search

replace

(?msi)new.line

newline

example text

"HAHA! Hahaha!"

example search

'(?i)(h)a'

example replace

'\1e'

example return

"HeHe! Hehehe!"

param text

The text to normalize

param bool return_logs

Return normalization logs

normalization.replace

Simple search replace

param search

Text to search for

param replace

Text to replace with

example text

"Nudge nudge!"

example search

"nudge"

example replace

"wink"

example return

"Nudge wink!"

param text

The text to normalize

param bool return_logs

Return normalization logs

normalization.replacewords

Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..

param search

Word to search for

param replace

Replace with

example text

"She has a heart of formica"

example search

"a"

example replace

"the"

example return

"She has the heart of formica"

param text

The text to normalize

param bool return_logs

Return normalization logs

normalization.unidecode

Unidecode characters to ASCII form, see Python's Unidecode package for more info.

example text

"𝖂𝖊𝖓𝖓 𝖎𝖘𝖙 𝖉𝖆𝖘 𝕹𝖚𝖓𝖘𝖙ü𝖈𝖐 𝖌𝖎𝖙 𝖚𝖓𝖉 𝕾𝖑𝖔𝖙𝖊𝖗𝖒𝖊𝖞𝖊𝖗?"

example return

"Wenn ist das Nunstuck git und Slotermeyer?"

param text

The text to normalize

param bool return_logs

Return normalization logs

list.metrics

Get a list of available core metrics

return object

With key being the metrics name, and value its description

metrics.beer

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

param ref

Reference text

param hyp

Hypothesis text

metrics.cer

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.

param ref

Reference text

param hyp

Hypothesis text

metrics.diffcounts

Get the amount of differences between reference and hypothesis

param ref

Reference text

param hyp

Hypothesis text

metrics.wer

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.

param ref

Reference text

param hyp

Hypothesis text

metrics.worddiffs

Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect

'html'

param differ_class

For future use.

param ref

Reference text

param hyp

Hypothesis text

list.benchmark

Get a list of available core benchmark

return object

With key being the benchmark name, and value its description

benchmark.beer

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

param ref

Reference text

param hyp

Hypothesis text

param config

The config to use

param bool return_logs

Return normalization logs

example ref

'Hello darkness my OLD friend'

example hyp

'Hello darkness my old foe'

example config
[normalization]
# using a simple config file
Lowercase
example result

""

benchmark.cer

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.

param ref

Reference text

param hyp

Hypothesis text

param config

The config to use

param bool return_logs

Return normalization logs

example ref

'Hello darkness my OLD friend'

example hyp

'Hello darkness my old foe'

example config
[normalization]
# using a simple config file
Lowercase
example result

""

benchmark.diffcounts

Get the amount of differences between reference and hypothesis

param ref

Reference text

param hyp

Hypothesis text

param config

The config to use

param bool return_logs

Return normalization logs

example ref

'Hello darkness my OLD friend'

example hyp

'Hello darkness my old foe'

example config
[normalization]
# using a simple config file
Lowercase
example result

""

benchmark.wer

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.

param ref

Reference text

param hyp

Hypothesis text

param config

The config to use

param bool return_logs

Return normalization logs

example ref

'Hello darkness my OLD friend'

example hyp

'Hello darkness my old foe'

example config
[normalization]
# using a simple config file
Lowercase
example result

""

benchmark.worddiffs

Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect

'html'

param differ_class

For future use.

param ref

Reference text

param hyp

Hypothesis text

param config

The config to use

param bool return_logs

Return normalization logs

example ref

'Hello darkness my OLD friend'

example hyp

'Hello darkness my old foe'

example config
[normalization]
# using a simple config file
Lowercase
example result

""

help

Returns available api methods

return object

With key being the method name, and value its description

Development

Setting up environment

This assumes git and Python 3.5 or above are already installed on your system (see Installation).

  1. Fork the repository source code from github to your own account.

  2. Clone the repository from github to your local development environment (replace [YOURUSERNAME] with your github username):

    git clone https://github.com/[YOURUSERNAME]/benchmarkstt.git
    cd benchmarkstt
    
  3. Create and activate a local environment:

    python3 -m pip install venv
    python3 -m venv env
    source env/bin/activate
    
  4. Install the package, this will also install all requirements. This does an "editable" install, i.e. it creates a symbolic link to the source code:

    make dev
    
  5. You now have a local development environment where you can commit and push to your own forked repository. It is recommended to run the tests to check your local copy passes all unit tests:

    make test
    

Warning

The development version of benchmarkstt and benchmarkstt-tools is only available in your current venv environment. Make sure to run source env/bin/activate to activate your local venv before making calls to benchmarkstt or benchmarkstt-tools.

Building the documentation

First install the dependencies for building the documentation (sphinx, etc.) using:

make setupdocs

This only needs to be done once.

Then to build the documentation locally:

make docs

The documentation will be created in /docs/build/html/

Contributing

Contributing

[Status: Draft]

This project has a Code of Conduct that we expect all of our contributors to abide by, please check it out before contributing.

Pull requests and branching
  • Before working on a feature always create a new branch first. (or fork the project).

  • Branches should be short lived, except branches specifically labelled 'experiment'.

  • Once work is complete push the branch up on to GitHub for review. Make sure your branch is up to date with master before making a pull request. Eg. use git merge origin/master

  • Once a branch has been merged into master, delete it.

master is never committed to directly unless the change is very trivial or a code review is unnecessary (code formatting or documentation updates for example).

License

By contributing to benchmarkstt, you agree that your contributions will be licensed under the :doc:LICENSE.md file in the root directory of this source tree.

Code of Conduct

Status: Draft

We are committed to providing a friendly, safe and welcoming environment for all, regardless of gender, sexual orientation, disability, ethnicity, religion, or similar personal characteristic.

We’ve written this code of conduct not because we expect bad behaviour from our community—which, in our experience, is overwhelmingly kind and civil—but because we believe a clear code of conduct is one necessary part of building a respectful community space.

We are committed to providing a welcoming and inspiring community for all and expect our code of conduct to be honored. Anyone who violates this code of conduct may be banned from the community.

Please be kind and courteous. There's no need to be mean or rude. Respect that people have differences of opinion and that every design or implementation choice carries a trade-off and numerous costs. There is seldom a right answer, merely an optimal answer given a set of values and circumstances.

Our open community strives to:

Be friendly and patient.

Be considerate: Your work will be used by other people, and you in turn will depend on the work of others. Any decision you take will affect users and colleagues, and you should take those consequences into account when making decisions. Remember that we’re a world-wide community, so you might not be communicating in someone else’s primary language.

Be respectful: Not all of us will agree all the time, but disagreement is no excuse for poor behaviour and poor manners. We might all experience some frustration now and then, but we cannot allow that frustration to impact others. It’s important to remember that a community where people feel uncomfortable or threatened is not a productive one.

Be careful in the words that we choose: we are a community of professionals, and we conduct ourselves professionally. Be kind to others. Do not insult or put down other participants. Harassment and other exclusionary behaviour aren’t acceptable.

Try to understand why we disagree: Disagreements, both social and technical, happen all the time. It is important that we resolve disagreements and differing views constructively. Remember that we’re different. The strength of our community comes from its diversity, people from a wide range of backgrounds. Different people have different perspectives on issues. Being unable to understand why someone holds a viewpoint doesn’t mean that they’re wrong. Don’t forget that it is human to err and blaming each other doesn’t get us anywhere. Instead, focus on helping to resolve issues and learning from mistakes.

What goes around comes a round. We believe in open source, and are excited by what happens when people add value to each others work in a collaborative way.

Take care of each other. Alert a member of the project team if you notice a dangerous situation, someone in distress, or violations of this code of conduct, even if they seem inconsequential.

If any participants engages in harassing behaviour, the project team may take any lawful action we deem appropriate, including but not limited to warning the offender or asking the offender to leave the project.

Diversity Statement

We encourage everyone to participate and are committed to building a community for all. Although we will fail at times, we seek to treat everyone both as fairly and equally as possible. Whenever a participant has made a mistake, we expect them to take responsibility for it. If someone has been harmed or offended, it is our responsibility to listen carefully and respectfully, and do our best to right the wrong.

Reporting Issues

NOTE: no contact has yet been decided

If you experience or witness unacceptable behaviour—or have any other concerns—please report it by contacting us via [TODO] All reports will be handled with discretion. In your report please include:

  • Your contact information.

  • Names (real, nicknames, or pseudonyms) of any individuals involved. If there are additional witnesses, please include them as well. Your account of what occurred, and if you believe the incident is ongoing. If there is a publicly available record (e.g. a mailing list archive or a public slack channel), please include a link.

  • Any additional information that may be helpful.

After filing a report, a representative will contact you personally, review the incident, follow up with any additional questions, and make a decision as to how to respond. If the person who is harassing you is part of the response team, they will recuse themselves from handling your incident. If the complaint originates from a member of the response team, it will be handled by a different member of the response team. We will respect confidentiality requests for the purpose of protecting victims of abuse.

Feedback

We welcome your feedback on this and every other aspect of this project and we thank you for working with us to make it a safe, enjoyable, and friendly experience for everyone who participates.

Attribution & Acknowledgements

We all stand on the shoulders of giants across many open source communities. We’d like to thank the communities and projects that established code of conducts and diversity statements as our inspiration:

benchmarkstt package

Package benchmarkstt

Subpackages

benchmarkstt.api package

Responsible for providing a JSON-RPC api.

Subpackages
benchmarkstt.api.entrypoints package
Submodules
benchmarkstt.api.entrypoints.api module
benchmarkstt.api.entrypoints.benchmark module
benchmarkstt.api.entrypoints.benchmark.callback(cls, ref: str, hyp: str, config: str = None, return_logs: bool = None, *args, **kwargs)[source]
Parameters
  • ref -- Reference text

  • hyp -- Hypothesis text

  • config -- The config to use

  • return_logs (bool) -- Return normalization logs

Example ref

'Hello darkness my OLD friend'

Example hyp

'Hello darkness my old foe'

Example config
[normalization]
# using a simple config file
Lowercase
Example result

""

benchmarkstt.api.entrypoints.metrics module
benchmarkstt.api.entrypoints.metrics.callback(cls, ref: str, hyp: str, *args, **kwargs)[source]
Parameters
  • ref -- Reference text

  • hyp -- Hypothesis text

benchmarkstt.api.entrypoints.normalization module
benchmarkstt.api.entrypoints.normalization.callback(cls, text: str, return_logs: bool = None, *args, **kwargs)[source]
Parameters
  • text -- The text to normalize

  • return_logs (bool) -- Return normalization logs

Submodules
benchmarkstt.api.gunicorn module

Entry point for a gunicorn server, serves at /api

benchmarkstt.api.jsonrpc module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram SecurityError DefaultMethods MagicMethods ValueError <|-- SecurityError class SecurityError { } class MagicMethods { +load(name, module) +register(name, callback) +serve(config, callback) is_safe_path(path)$ } class DefaultMethods { +version() help(methods)$ }

Make benchmarkstt available through a rudimentary JSON-RPC interface

Warning

Only supported for Python versions 3.6 and above!

class benchmarkstt.api.jsonrpc.DefaultMethods[source]

Bases: object

static help(methods)[source]
static version()[source]

Get the version of benchmarkstt

Return str

BenchmarkSTT version

class benchmarkstt.api.jsonrpc.MagicMethods[source]

Bases: object

static is_safe_path(path)[source]

Determines whether the file or path is within the current working directory

Parameters

path (str|PathLike) --

Returns

bool

load(name, module)[source]

Load all possible callbacks for a given module

Parameters
  • name --

  • module (Module) --

possible_path_args = ['file', 'path']
register(name, callback)[source]

Register a callback as an api call

Parameters
  • name --

  • callback --

serve(config, callback)[source]

Responsible for creating a callback with proper documentation and arguments signature that can be registered as an api call.

Parameters
  • config --

  • callback --

Returns

callable

exception benchmarkstt.api.jsonrpc.SecurityError[source]

Bases: ValueError

Trying to do or access something that isn't allowed

benchmarkstt.api.jsonrpc.get_methods() → jsonrpcserver.methods.Methods[source]

Returns the available JSON-RPC api methods

Returns

jsonrpcserver.methods.Methods

benchmarkstt.cli package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram CustomHelpFormatter HelpFormatter <|-- CustomHelpFormatter class CustomHelpFormatter { *args **kwargs +add_argument(action) +add_arguments(actions) +add_text(text) +add_usage(usage, actions, groups, prefix=None) +end_section() +format_help() +start_section(heading) }

Responsible for handling the command line tools.

class benchmarkstt.cli.CustomHelpFormatter(*args, **kwargs)[source]

Bases: argparse.HelpFormatter

Custom formatter for argparse that allows us to properly display _ActionWithArguments and docblock documentation, as well as allowing newlines inside the description.

benchmarkstt.cli.action_with_arguments(action, required_args, optional_args)[source]

Custom argparse action to support a variable amount of arguments

Parameters
  • action -- name of the action

  • required_args (list) -- required arguments

  • optional_args (list) -- optional arguments

Return type

ActionWithArguments

benchmarkstt.cli.args_common(parser)[source]
benchmarkstt.cli.args_complete(parser)[source]
benchmarkstt.cli.args_from_factory(action, factory, parser)[source]
benchmarkstt.cli.args_help(parser)[source]
benchmarkstt.cli.before_parseargs()[source]
benchmarkstt.cli.create_parser(*args, **kwargs)[source]
benchmarkstt.cli.determine_log_level()[source]
benchmarkstt.cli.format_helptext(txt)[source]
benchmarkstt.cli.preload_externals()[source]
Subpackages
benchmarkstt.cli.entrypoints package
Submodules
benchmarkstt.cli.entrypoints.api module

Make benchmarkstt available through a rudimentary JSON-RPC interface

Attention

Only supported for Python versions 3.6 and above

benchmarkstt.cli.entrypoints.api.argparser(parser)[source]

Adds the help and arguments specific to this module

benchmarkstt.cli.entrypoints.api.create_app(entrypoint: str = None, with_explorer: bool = None)[source]

Create the Flask app

Parameters
  • entrypoint -- The HTTP path on which the api will be served

  • with_explorer (bool) -- Whether to also serve the JSON-RPC API explorer

Returns

benchmarkstt.cli.entrypoints.api.run(_parser, args)[source]
benchmarkstt.cli.entrypoints.benchmark module

Do a complete flow of input -> normalization -> segmentation -> metrics

benchmarkstt.cli.entrypoints.benchmark.argparser(parser: argparse.ArgumentParser)[source]
benchmarkstt.cli.entrypoints.benchmark.run(parser, args)[source]
benchmarkstt.cli.entrypoints.metrics module

Calculate metrics based on the comparison of a hypothesis with a reference.

benchmarkstt.cli.entrypoints.metrics.argparser(parser: argparse.ArgumentParser)[source]
benchmarkstt.cli.entrypoints.metrics.file_to_iterable(file, type_, normalizer=None)[source]
benchmarkstt.cli.entrypoints.metrics.run(parser, args, normalizer=None)[source]
benchmarkstt.cli.entrypoints.normalization module

Apply normalization to given input

benchmarkstt.cli.entrypoints.normalization.argparser(parser: argparse.ArgumentParser)[source]

Adds the help and arguments specific to this module

benchmarkstt.cli.entrypoints.normalization.args_inputfile(parser)[source]
benchmarkstt.cli.entrypoints.normalization.args_logs(parser: argparse.ArgumentParser)[source]
benchmarkstt.cli.entrypoints.normalization.args_normalizers(parser: argparse.ArgumentParser)[source]
benchmarkstt.cli.entrypoints.normalization.get_normalizer_from_args(args)[source]
benchmarkstt.cli.entrypoints.normalization.run(parser, args)[source]
Submodules
benchmarkstt.cli.main module
benchmarkstt.cli.main.argparser()[source]
benchmarkstt.cli.main.parser_context()[source]
benchmarkstt.cli.main.run()[source]
benchmarkstt.cli.tools module
benchmarkstt.cli.tools.argparser()[source]
benchmarkstt.cli.tools.run()[source]
benchmarkstt.diff package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Differ class Differ { <<abstract>> +get_opcodes() a b }

Responsible for calculating differences.

class benchmarkstt.diff.Differ(a, b)[source]

Bases: abc.ABC

abstract get_opcodes()[source]

Return list of 5-tuples describing how to turn a into b.

Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple has i1 == j1 == 0, and remaining tuples have i1 equals the i2 from the tuple preceding it, and likewise for j1 equals the previous j2.

The tags are strings, with these meanings:

  • 'replace': a[i1:i2] should be replaced by b[j1:j2]

  • 'delete': a[i1:i2] should be deleted. Note that j1==j2 in this case.

  • 'insert': b[j1:j2] should be inserted at a[i1:i1]. Note that i1==i2 in this case.

  • 'equal': a[i1:i2] == b[j1:j2]

Submodules
benchmarkstt.diff.core module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram RatcliffObershelp Differ <|-- RatcliffObershelp class RatcliffObershelp { +get_opcodes() a b **kwargs }

Core Diff algorithms

class benchmarkstt.diff.core.RatcliffObershelp(a, b, **kwargs)[source]

Bases: benchmarkstt.diff.Differ

Diff according to Ratcliff and Obershelp (Gestalt) matching algorithm.

From difflib.SequenceMatcher (Copyright 2001-2020, Python Software Foundation.)

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

get_opcodes()[source]

Return list of 5-tuples describing how to turn a into b.

Each tuple is of the form (tag, i1, i2, j1, j2). The first tuple has i1 == j1 == 0, and remaining tuples have i1 equals the i2 from the tuple preceding it, and likewise for j1 equals the previous j2.

The tags are strings, with these meanings:

  • 'replace': a[i1:i2] should be replaced by b[j1:j2]

  • 'delete': a[i1:i2] should be deleted. Note that j1==j2 in this case.

  • 'insert': b[j1:j2] should be inserted at a[i1:i1]. Note that i1==i2 in this case.

  • 'equal': a[i1:i2] == b[j1:j2]

benchmarkstt.diff.formatter module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram JSONDiffDialect Dialect DiffFormatter UTF8Dialect RestructuredTextDialect ANSIDiffDialect HTMLDiffDialect ListDialect Dialect <|-- ANSIDiffDialect Dialect <|-- UTF8Dialect Dialect <|-- HTMLDiffDialect ANSIDiffDialect <|-- RestructuredTextDialect Dialect <|-- ListDialect ListDialect <|-- JSONDiffDialect class Dialect { <<context>> +output() } class ANSIDiffDialect { <<context>> +output() preprocessor(txt)$ show_color_key=None } class UTF8Dialect { <<context>> +delete_format(txt) +insert_format(txt) +output() preprocessor(txt)$ } class HTMLDiffDialect { <<context>> +output() preprocessor(txt)$ } class RestructuredTextDialect { <<context>> +output() preprocessor(txt)$ show_color_key=None } class ListDialect { <<context>> +delete_format(txt) +equal_format(txt) +insert_format(txt) +output() +replace_format(a, b) preprocessor(txt)$ } class JSONDiffDialect { <<context>> +delete_format(txt) +equal_format(txt) +insert_format(txt) +output() +replace_format(a, b) preprocessor(txt)$ } class DiffFormatter { +diff(a, b, opcodes=None, preprocessor=None) dialect=None *args **kwargs has_dialect(dialect)$ }
class benchmarkstt.diff.formatter.ANSIDiffDialect(show_color_key=None)[source]

Bases: benchmarkstt.diff.formatter.Dialect

delete_format = '\x1b[31m%s\x1b[0m'
insert_format = '\x1b[32m%s\x1b[0m'
static preprocessor(txt)[source]
class benchmarkstt.diff.formatter.Dialect[source]

Bases: object

delete_format = '%s'
equal_format = '%s'
insert_format = '%s'
output()[source]
preprocessor = None
replace_format = None
property stream
class benchmarkstt.diff.formatter.DiffFormatter(dialect=None, *args, **kwargs)[source]

Bases: object

diff(a, b, opcodes=None, preprocessor=None)[source]
diff_dialects = {'ansi': <class 'benchmarkstt.diff.formatter.ANSIDiffDialect'>, 'html': <class 'benchmarkstt.diff.formatter.HTMLDiffDialect'>, 'json': <class 'benchmarkstt.diff.formatter.JSONDiffDialect'>, 'list': <class 'benchmarkstt.diff.formatter.ListDialect'>, 'rst': <class 'benchmarkstt.diff.formatter.RestructuredTextDialect'>, 'text': <class 'benchmarkstt.diff.formatter.UTF8Dialect'>}
classmethod has_dialect(dialect)[source]
class benchmarkstt.diff.formatter.HTMLDiffDialect[source]

Bases: benchmarkstt.diff.formatter.Dialect

delete_format = '<span class="delete">%s</span>'
insert_format = '<span class="insert">%s</span>'
static preprocessor(txt)[source]
class benchmarkstt.diff.formatter.JSONDiffDialect[source]

Bases: benchmarkstt.diff.formatter.ListDialect

output()[source]
class benchmarkstt.diff.formatter.ListDialect[source]

Bases: benchmarkstt.diff.formatter.Dialect

delete_format(txt)[source]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

equal_format(txt)[source]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

insert_format(txt)[source]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

output()[source]
static preprocessor(txt)[source]
replace_format(a, b)[source]
class benchmarkstt.diff.formatter.RestructuredTextDialect(show_color_key=None)[source]

Bases: benchmarkstt.diff.formatter.ANSIDiffDialect

delete_format = '\\ :diffdelete:`%s`\\ '
insert_format = '\\ :diffinsert:`%s`\\ '
static preprocessor(txt)[source]
class benchmarkstt.diff.formatter.UTF8Dialect[source]

Bases: benchmarkstt.diff.formatter.Dialect

delete_format(txt)[source]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

insert_format(txt)[source]

str(object='') -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to 'strict'.

static preprocessor(txt)[source]
benchmarkstt.diff.formatter.format_diff(a, b, opcodes=None, dialect=None, preprocessor=None)[source]
benchmarkstt.input package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Input class Input { <<abstract>> <<iterable>> }

Responsible for dealing with input formats and converting them to benchmarkstt native schema

class benchmarkstt.input.Input[source]

Bases: abc.ABC

Submodules
benchmarkstt.input.core module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram PlainText File Input <|-- PlainText Input <|-- File class PlainText { <<iterable>> text normalizer=None segmenter=None } class File { <<iterable>> +available_types() file input_type=None normalizer=None }

Default input formats

class benchmarkstt.input.core.File(file, input_type=None, normalizer=None)[source]

Bases: benchmarkstt.input.Input

Load from a given filename.

classmethod available_types()[source]
class benchmarkstt.input.core.PlainText(text, normalizer=None, segmenter=None)[source]

Bases: benchmarkstt.input.Input

Plain text.

benchmarkstt.metrics package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Schema Metric Item Mapping <|-- Item Schema <.. Metric Item <.. Schema Schema <.. Metric class Item { <<len>> <<mapping>> <<contains>> <<iterable>> <<comparable>> *args **kwargs +get(key, default=None) +items() +json(**kwargs) +keys() +values() } class Schema { <<len>> <<mapping>> <<iterable>> <<comparable>> +append(obj: Item) +extend(iterable) +json(**kwargs) data=None dump(*args, **kwargs)$ dumps(*args, **kwargs)$ load(*args, **kwargs)$ loads(*args, **kwargs)$ } class Metric { <<abstract>> +compare(ref: Schema, hyp: Schema) }

Responsible for calculating metrics.

class benchmarkstt.metrics.Metric[source]

Bases: abc.ABC

Base class for metrics

abstract compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
Submodules
benchmarkstt.metrics.core module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram BEER Schema Item WordDiffs WER CER DiffCounts Differ Mapping <|-- Item Metric <|-- WordDiffs Metric <|-- WER Metric <|-- CER Metric <|-- DiffCounts Metric <|-- BEER Differ <.. WordDiffs Schema <.. WordDiffs Item <.. Schema Schema <.. WordDiffs Differ <.. WER Schema <.. WER Schema <.. WER Schema <.. CER Schema <.. CER Differ <.. DiffCounts Schema <.. DiffCounts Schema <.. DiffCounts Schema <.. BEER Schema <.. BEER class OpcodeCounts { <<namedtuple>> equal replace insert delete } class Differ { <<abstract>> +get_opcodes() a b } class Item { <<len>> <<mapping>> <<contains>> <<iterable>> <<comparable>> *args **kwargs +get(key, default=None) +items() +json(**kwargs) +keys() +values() } class Schema { <<len>> <<mapping>> <<iterable>> <<comparable>> +append(obj: Item) +extend(iterable) +json(**kwargs) data=None dump(*args, **kwargs)$ dumps(*args, **kwargs)$ load(*args, **kwargs)$ loads(*args, **kwargs)$ } class WordDiffs { +compare(ref: Schema, hyp: Schema) dialect=None differ_class: Differ } class WER { +compare(ref: Schema, hyp: Schema) mode=None differ_class: Differ } class CER { +compare(ref: Schema, hyp: Schema) mode=None differ_class=None } class DiffCounts { +compare(ref: Schema, hyp: Schema) mode=None differ_class: Differ } class BEER { +compare(ref: Schema, hyp: Schema) +compute_beer(list_hypothesis_entity, list_reference_entity) +get_entities() +get_weight() +set_entities(entities) +set_weight(weight) entities_file=None }
class benchmarkstt.metrics.core.BEER(entities_file=None)[source]

Bases: benchmarkstt.metrics.Metric

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
compute_beer(list_hypothesis_entity, list_reference_entity)[source]
get_entities()[source]
get_weight()[source]
set_entities(entities)[source]
set_weight(weight)[source]
class benchmarkstt.metrics.core.CER(mode=None, differ_class=None)[source]

Bases: benchmarkstt.metrics.Metric

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

Parameters
  • mode -- 'levenshtein' (default).

  • differ_class -- For future use.

MODE_LEVENSHTEIN = 'levenshtein'
compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
class benchmarkstt.metrics.core.DiffCounts(mode=None, differ_class: benchmarkstt.diff.Differ = None)[source]

Bases: benchmarkstt.metrics.Metric

Get the amount of differences between reference and hypothesis

MODE_LEVENSHTEIN = 'levenshtein'
compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)benchmarkstt.metrics.core.OpcodeCounts[source]
class benchmarkstt.metrics.core.OpcodeCounts(equal, replace, insert, delete)

Bases: tuple

property delete

Alias for field number 3

property equal

Alias for field number 0

property insert

Alias for field number 2

property replace

Alias for field number 1

class benchmarkstt.metrics.core.WER(mode=None, differ_class: benchmarkstt.diff.Differ = None)[source]

Bases: benchmarkstt.metrics.Metric

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

Parameters
  • mode -- 'strict' (default), 'hunt' or 'levenshtein'.

  • differ_class -- For future use.

DEL_PENALTY = 1
INS_PENALTY = 1
MODE_HUNT = 'hunt'
MODE_LEVENSHTEIN = 'levenshtein'
MODE_STRICT = 'strict'
SUB_PENALTY = 1
compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema) → float[source]
class benchmarkstt.metrics.core.WordDiffs(dialect=None, differ_class: benchmarkstt.diff.Differ = None)[source]

Bases: benchmarkstt.metrics.Metric

Present differences on a per-word basis

Parameters
  • dialect -- Presentation format. Default is 'ansi'.

  • differ_class -- For future use.

Example dialect

'html'

compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
benchmarkstt.metrics.core.get_differ(a, b, differ_class: benchmarkstt.diff.Differ)[source]
benchmarkstt.metrics.core.get_opcode_counts(opcodes)benchmarkstt.metrics.core.OpcodeCounts[source]
benchmarkstt.metrics.core.traversible(schema, key=None)[source]
benchmarkstt.normalization package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Normalizer File FileFactory NormalizationAggregate NormalizerWithFileSupport Normalizer <|-- NormalizerWithFileSupport Normalizer <|-- NormalizationAggregate Normalizer <|-- File CoreFactory <|-- FileFactory class Normalizer { <<abstract>> normalize(text)$ } class NormalizerWithFileSupport { <<abstract>> normalize(text)$ } class NormalizationAggregate { +add(normalizer) normalize(text)$ title=None } class File { normalize(text)$ normalizer file encoding=None path=None } class FileFactory { <<mapping>> <<contains>> <<iterable>> +create(name, file=None, encoding=None, path=None) +is_valid(*args, **kwargs) +keys() +register(*args, **kwargs) add_supported_namespace(namespace)$ base_class allow_duck=None }

Responsible for normalization of text.

class benchmarkstt.normalization.File(normalizer, file, encoding=None, path=None)[source]

Bases: benchmarkstt.normalization.Normalizer

Read one per line and pass it to the given normalizer

Parameters
  • normalizer (str|class) -- Normalizer name (or class)

  • file -- The file to read rules from

  • encoding -- The file encoding

Example text

"This is an Ex-Parakeet"

Example normalizer

"regex"

Example file

"./resources/test/normalizers/regex/en_US"

Example encoding

"UTF-8"

Example return

"This is an Ex Parrot"

_normalize(text: str) → str[source]
class benchmarkstt.normalization.FileFactory(base_class, allow_duck=None)[source]

Bases: benchmarkstt.factory.CoreFactory

create(name, file=None, encoding=None, path=None)[source]
class benchmarkstt.normalization.NormalizationAggregate(title=None)[source]

Bases: benchmarkstt.normalization.Normalizer

Combining normalizers

_normalize(text: str) → str[source]
add(normalizer)[source]

Adds a normalizer to the composite "stack"

class benchmarkstt.normalization.Normalizer[source]

Bases: benchmarkstt.normalization._NormalizerNoLogs

Abstract base class for normalization

abstract _normalize(text: str) → str[source]
normalize(text)

Returns normalized text with rules supplied by the called class.

class benchmarkstt.normalization.NormalizerWithFileSupport[source]

Bases: benchmarkstt.normalization.Normalizer

This kind of normalization class supports loading the values from a file, i.e. being wrapped in a core.File wrapper.

abstract _normalize(text: str) → str[source]
Submodules
benchmarkstt.normalization.core module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram ReplaceWords Lowercase Regex ConfigSectionNotFoundError Unidecode Replace Config NormalizerWithFileSupport <|-- Replace NormalizerWithFileSupport <|-- ReplaceWords NormalizerWithFileSupport <|-- Regex Normalizer <|-- Lowercase Normalizer <|-- Unidecode ValueError <|-- ConfigSectionNotFoundError Normalizer <|-- Config class Replace { normalize(text)$ search: str replace: str } class ReplaceWords { normalize(text)$ search: str replace: str } class Regex { normalize(text)$ search: str replace: str } class Lowercase { normalize(text)$ } class Unidecode { normalize(text)$ } class ConfigSectionNotFoundError { } class Config { +refresh_docstring() default_section(section)$ file section=None encoding=None normalize(text)$ }

Some basic/simple normalization classes

class benchmarkstt.normalization.core.Config(file, section=None, encoding=None)[source]

Bases: benchmarkstt.normalization.Normalizer

Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.

Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.

The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).

Additional rules:
  • Normalizer names are case-insensitive.

  • Arguments MAY be wrapped in double quotes.

  • If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.

  • A double quote itself is represented in this quoted argument as two double quotes: "".

The normalization rules are applied top-to-bottom and follow this format:

[normalization]
# This is a comment

# (Normalizer2 has no arguments)
lowercase

# loads regex expressions from regexrules.csv in "utf 8" encoding
regex regexrules.csv "utf 8"

# load another config file, [section1] and [section2]
config configfile.ini section1
config configfile.ini section2

# loads replace expressions from replaces.csv in default encoding
replace     replaces.csv
Parameters
  • file -- The config file

  • encoding -- The file encoding

  • section -- The subsection of the config file to use, defaults to 'normalization'

Example text

"He bravely turned his tail and fled"

Example file

"./resources/test/normalizers/configfile.conf"

Example encoding

"UTF-8"

Example return

"ha bravalY Turnad his tail and flad"

MAIN_SECTION = <object object>
_normalize(text: str) → str[source]
classmethod default_section(section)[source]
classmethod refresh_docstring()[source]
exception benchmarkstt.normalization.core.ConfigSectionNotFoundError[source]

Bases: ValueError

Raised when a requested config section was not found

class benchmarkstt.normalization.core.Lowercase[source]

Bases: benchmarkstt.normalization.Normalizer

Lowercase the text

Example text

"Easy, Mungo, easy... Mungo..."

Example return

"easy, mungo, easy... mungo..."

_normalize(text: str) → str[source]
class benchmarkstt.normalization.core.Regex(search: str, replace: str)[source]

Bases: benchmarkstt.normalization.NormalizerWithFileSupport

Simple regex replace. By default the pattern is interpreted case-sensitive.

Case-insensitivity is supported by adding inline modifiers.

You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...

Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":

search

replace

(?i)(h)a

\1e

No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.

Eg. would replace "New<CRLF>line" to "newline":

search

replace

(?msi)new.line

newline

Example text

"HAHA! Hahaha!"

Example search

'(?i)(h)a'

Example replace

'\1e'

Example return

"HeHe! Hehehe!"

_normalize(text: str) → str[source]
class benchmarkstt.normalization.core.Replace(search: str, replace: str)[source]

Bases: benchmarkstt.normalization.NormalizerWithFileSupport

Simple search replace

Parameters
  • search -- Text to search for

  • replace -- Text to replace with

Example text

"Nudge nudge!"

Example search

"nudge"

Example replace

"wink"

Example return

"Nudge wink!"

_normalize(text: str) → str[source]
class benchmarkstt.normalization.core.ReplaceWords(search: str, replace: str)[source]

Bases: benchmarkstt.normalization.NormalizerWithFileSupport

Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..

Parameters
  • search -- Word to search for

  • replace -- Replace with

Example text

"She has a heart of formica"

Example search

"a"

Example replace

"the"

Example return

"She has the heart of formica"

_normalize(text: str) → str[source]
class benchmarkstt.normalization.core.Unidecode[source]

Bases: benchmarkstt.normalization.Normalizer

Unidecode characters to ASCII form, see Python's Unidecode package for more info.

Example text

"𝖂𝖊𝖓𝖓 𝖎𝖘𝖙 𝖉𝖆𝖘 𝕹𝖚𝖓𝖘𝖙ü𝖈𝖐 𝖌𝖎𝖙 𝖚𝖓𝖉 𝕾𝖑𝖔𝖙𝖊𝖗𝖒𝖊𝖞𝖊𝖗?"

Example return

"Wenn ist das Nunstuck git und Slotermeyer?"

_normalize(text: str) → str[source]
benchmarkstt.normalization.logger module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram DiffLoggingFormatter Logger ListHandler LogCapturer DiffLoggingDictFormatterDialect DiffLoggingFormatterDialect DiffLoggingTextFormatterDialect StreamHandler <|-- ListHandler DiffLoggingFormatterDialect <|-- DiffLoggingTextFormatterDialect DiffLoggingFormatterDialect <|-- DiffLoggingDictFormatterDialect Formatter <|-- DiffLoggingFormatter class NormalizedLogItem { <<namedtuple>> stack original normalized } class Logger { } class ListHandler { +acquire() +addFilter(filter) +close() +createLock() +emit(record) +filter(record) +flush() +format(record) +get_name() +handle(record) +handleError(record) +release() +removeFilter(filter) +setFormatter(fmt) +setLevel(level) +setStream(stream) +set_name(name) } class DiffLoggingFormatterDialect { +format(title, stack, diff) } class DiffLoggingTextFormatterDialect { +format(title, stack, diff) } class DiffLoggingDictFormatterDialect { +format(title, stack, diff) } class DiffLoggingFormatter { +format(record) +formatException(ei) +formatMessage(record) +formatStack(stack_info) +formatTime(record, datefmt=None) +usesTime() dialect=None diff_formatter_dialect=None title=None *args **kwargs get_dialect(dialect, strict=None)$ has_dialect(dialect)$ } class LogCapturer { <<context>> *args **kwargs }
class benchmarkstt.normalization.logger.DiffLoggingDictFormatterDialect[source]

Bases: benchmarkstt.normalization.logger.DiffLoggingFormatterDialect

format(title, stack, diff)[source]
class benchmarkstt.normalization.logger.DiffLoggingFormatter(dialect=None, diff_formatter_dialect=None, title=None, *args, **kwargs)[source]

Bases: logging.Formatter

diff_logging_formatter_dialects = {'dict': <class 'benchmarkstt.normalization.logger.DiffLoggingDictFormatterDialect'>, 'text': <class 'benchmarkstt.normalization.logger.DiffLoggingTextFormatterDialect'>}
format(record)[source]

Format the specified record as text.

The record's attribute dictionary is used as the operand to a string formatting operation which yields the returned string. Before formatting the dictionary, a couple of preparatory steps are carried out. The message attribute of the record is computed using LogRecord.getMessage(). If the formatting string uses the time (as determined by a call to usesTime(), formatTime() is called to format the event time. If there is exception information, it is formatted using formatException() and appended to the message.

classmethod get_dialect(dialect, strict=None)[source]
classmethod has_dialect(dialect)[source]
class benchmarkstt.normalization.logger.DiffLoggingFormatterDialect[source]

Bases: object

format(title, stack, diff)[source]
class benchmarkstt.normalization.logger.DiffLoggingTextFormatterDialect[source]

Bases: benchmarkstt.normalization.logger.DiffLoggingFormatterDialect

format(title, stack, diff)[source]
class benchmarkstt.normalization.logger.ListHandler[source]

Bases: logging.StreamHandler

emit(record)[source]

Emit a record.

If a formatter is specified, it is used to format the record. The record is then written to the stream with a trailing newline. If exception information is present, it is formatted using traceback.print_exception and appended to the stream. If the stream has an 'encoding' attribute, it is used to determine how to do the output to the stream.

flush()[source]

Flushes the stream.

property logs
class benchmarkstt.normalization.logger.LogCapturer(*args, **kwargs)[source]

Bases: object

property logs
class benchmarkstt.normalization.logger.Logger[source]

Bases: object

class benchmarkstt.normalization.logger.NormalizedLogItem(stack, original, normalized)

Bases: tuple

property normalized

Alias for field number 2

property original

Alias for field number 1

property stack

Alias for field number 0

benchmarkstt.normalization.logger.log(func)[source]

Log decorator for normalization classes

benchmarkstt.output package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Output SimpleTextBase Output <|-- SimpleTextBase class Output { <<context>> <<abstract>> +result(title, result) } class SimpleTextBase { <<context>> <<abstract>> +result(title, result) print(result)$ }

Responsible for dealing with output formats

class benchmarkstt.output.Output[source]

Bases: abc.ABC

abstract result(title, result)[source]
class benchmarkstt.output.SimpleTextBase[source]

Bases: benchmarkstt.output.Output

static print(result)[source]
abstract result(title, result)[source]
Submodules
benchmarkstt.output.core module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Json MarkDown ReStructuredText SimpleTextBase <|-- ReStructuredText SimpleTextBase <|-- MarkDown Output <|-- Json class ReStructuredText { <<context>> +result(title, result) print(result)$ } class MarkDown { <<context>> +result(title, result) print(result)$ } class Json { <<context>> +result(title, result) }
class benchmarkstt.output.core.Json[source]

Bases: benchmarkstt.output.Output

result(title, result)[source]
class benchmarkstt.output.core.MarkDown[source]

Bases: benchmarkstt.output.SimpleTextBase

result(title, result)[source]
class benchmarkstt.output.core.ReStructuredText[source]

Bases: benchmarkstt.output.SimpleTextBase

result(title, result)[source]
benchmarkstt.segmentation package
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Segmenter class Segmenter { <<abstract>> <<iterable>> }

Responsible for segmenting text.

class benchmarkstt.segmentation.Segmenter[source]

Bases: abc.ABC

Submodules
benchmarkstt.segmentation.core module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Simple Segmenter <|-- Simple class Simple { <<iterable>> text: str pattern='[\\n\\t\\s]+' normalizer=None }

Core segmenters, each segmenter must be Iterable returning a Item

class benchmarkstt.segmentation.core.Simple(text: str, pattern='[\\n\\t\\s]+', normalizer=None)[source]

Bases: benchmarkstt.segmentation.Segmenter

Simplest case, split into words by white space

Submodules

benchmarkstt.config module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram SectionConfigReader class SectionConfigReader { <<mapping>> <<contains>> <<iterable>> config }
class benchmarkstt.config.SectionConfigReader(config)[source]

Bases: object

benchmarkstt.config.reader(file)[source]
benchmarkstt.csv module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Dialect UnknownDialectError Reader UnallowedQuoteError WhitespaceDialect CSVParserError DefaultDialect UnclosedQuoteError Line InvalidDialectError ValueError <|-- InvalidDialectError ValueError <|-- UnknownDialectError ValueError <|-- CSVParserError CSVParserError <|-- UnclosedQuoteError CSVParserError <|-- UnallowedQuoteError Dialect <|-- DefaultDialect DefaultDialect <|-- WhitespaceDialect list <|-- Line Dialect <.. Reader class InvalidDialectError { } class UnknownDialectError { } class CSVParserError { message line char index } class UnclosedQuoteError { message line char index } class UnallowedQuoteError { message line char index } class Dialect { } class DefaultDialect { } class WhitespaceDialect { } class Line { } class Reader { <<iterable>> file: <class 'TextIO'> dialect: Dialect debug=None }

Module providing a custom CSV file parser with support for whitespace trimming, empty lines filtering and comment lines

exception benchmarkstt.csv.CSVParserError(message, line, char, index)[source]

Bases: ValueError

Some error occured while attempting to parse the file

class benchmarkstt.csv.DefaultDialect[source]

Bases: benchmarkstt.csv.Dialect

commentchar = '#'
delimiter = ','
ignoreemptylines = True
quotechar = '"'
trimleft = ' \t\n\r'
trimright = ' \t\n\r'
class benchmarkstt.csv.Dialect[source]

Bases: object

commentchar = None
delimiter = None
quotechar = None
trimleft = None
trimright = None
exception benchmarkstt.csv.InvalidDialectError[source]

Bases: ValueError

An invalid dialect was supplied

class benchmarkstt.csv.Line(iterable=(), /)[source]

Bases: list

property lineno
class benchmarkstt.csv.Reader(file: TextIO, dialect: benchmarkstt.csv.Dialect, debug=None)[source]

Bases: object

CSV-like file reader with support for comment chars, ignoring empty lines and whitespace trimming on both sides of each field.

exception benchmarkstt.csv.UnallowedQuoteError(message, line, char, index)[source]

Bases: benchmarkstt.csv.CSVParserError

A quote is not allowed there

exception benchmarkstt.csv.UnclosedQuoteError(message, line, char, index)[source]

Bases: benchmarkstt.csv.CSVParserError

A quote wasn't properly closed

exception benchmarkstt.csv.UnknownDialectError[source]

Bases: ValueError

An unknown dialect was requested

class benchmarkstt.csv.WhitespaceDialect[source]

Bases: benchmarkstt.csv.DefaultDialect

delimiter = ' \t'
benchmarkstt.csv.reader(file: TextIO, dialect: Union[None, str, benchmarkstt.csv.Dialect] = None, **kwargs)benchmarkstt.csv.Reader[source]
benchmarkstt.decorators module
benchmarkstt.decorators.log_call(logger: logging.Logger, log_level=None, result=None)[source]

Decorator to log all calls to decorated function to given logger

>>> import logging, sys, io
>>>
>>> logger = logging.getLogger('logger_name')
>>> logger.setLevel(logging.DEBUG)
>>> ch = logging.StreamHandler(sys.stdout)
>>> ch.setFormatter(logging.Formatter('%(levelname)s:%(name)s: %(message)s'))
>>> logger.addHandler(ch)
>>>
>>> @log_call(logger, logging.WARNING)
... def test(*args, **kwargs):
...     return 'result'
>>> test('arg1', arg2='someval', arg3='someotherval')
WARNING:logger_name: test('arg1', arg2='someval', arg3='someotherval')
'result'
>>> @log_call(logger, result=True)
... def test(*args, **kwargs):
...     return 'result'
>>> test(arg2='someval', arg3='someotherval')
DEBUG:logger_name: test(arg2='someval', arg3='someotherval')
DEBUG:logger_name: test returned: result
'result'
benchmarkstt.deferred module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram DeferredCallback DeferredList class DeferredCallback { cb *args **kwargs } class DeferredList { <<mapping>> cb }
class benchmarkstt.deferred.DeferredCallback(cb, *args, **kwargs)[source]

Bases: object

Simple helper class to defer the execution of formatting functions until it is needed

class benchmarkstt.deferred.DeferredList(cb)[source]

Bases: object

property list
benchmarkstt.docblock module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram HTML5Writer TextWriter Writer <|-- HTML5Writer Writer <|-- TextWriter class Docblock { <<namedtuple>> docs params result result_type } class Param { <<namedtuple>> name type type_doc is_required description examples } class DocblockParam { <<namedtuple>> name type value } class HTML5Writer { +apply_template() +assemble_parts() +get_transforms() +interpolation_dict() +supports(format) +translate() +write(document, destination) } class TextWriter { +assemble_parts() +get_transforms() +supports(format) +translate() +write(document, destination) }
class benchmarkstt.docblock.Docblock(docs, params, result, result_type)

Bases: tuple

property docs

Alias for field number 0

property params

Alias for field number 1

property result

Alias for field number 2

property result_type

Alias for field number 3

class benchmarkstt.docblock.DocblockParam(name, type, value)

Bases: tuple

property name

Alias for field number 0

property type

Alias for field number 1

property value

Alias for field number 2

class benchmarkstt.docblock.HTML5Writer[source]

Bases: docutils.writers.html5_polyglot.Writer

apply_template()[source]
class benchmarkstt.docblock.Param(name, type, type_doc, is_required, description, examples)

Bases: tuple

property description

Alias for field number 4

property examples

Alias for field number 5

property is_required

Alias for field number 3

property name

Alias for field number 0

property type

Alias for field number 1

property type_doc

Alias for field number 2

class benchmarkstt.docblock.TextWriter[source]

Bases: docutils.writers.Writer

class TextVisitor(document)[source]

Bases: docutils.nodes.SparseNodeVisitor

text()[source]
visit_Text(node)[source]
visit_paragraph(node)[source]
translate()[source]

Do final translation of self.document into self.output. Called from write. Override in subclasses.

Usually done with a docutils.nodes.NodeVisitor subclass, in combination with a call to docutils.nodes.Node.walk() or docutils.nodes.Node.walkabout(). The NodeVisitor subclass must support all standard elements (listed in docutils.nodes.node_class_names) and possibly non-standard elements used by the current Reader as well.

benchmarkstt.docblock.decode_literal(txt: str)[source]
benchmarkstt.docblock.doc_param_parser(docstring, key, no_name=None, allow_multiple=None, replace_strat=None)[source]
benchmarkstt.docblock.format_docs(docs)[source]
benchmarkstt.docblock.parse(func)[source]
benchmarkstt.docblock.process_rst(text, writer=None)[source]
benchmarkstt.factory module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram CoreFactory ClassConfig Factory ClassConfigTuple <|-- ClassConfig Registry <|-- Factory class ClassConfigTuple { <<namedtuple>> name cls docs optional_args required_args } class ClassConfig { __new__(name, docs, optional_args, required_args)$ } class Factory { <<mapping>> <<contains>> <<iterable>> +create(alias, *args, **kwargs) +is_valid(tocheck) +keys() +register(alias=None) +register_classname(name, alias=None) +register_namespace(namespace) +unregister(key) base_class namespaces=None methods=None normalize_class_name(clsname)$ } class CoreFactory { <<mapping>> <<contains>> <<iterable>> +create(*args, **kwargs) +is_valid(*args, **kwargs) +keys() +register(*args, **kwargs) add_supported_namespace(namespace)$ base_class allow_duck=None }
class benchmarkstt.factory.ClassConfig(name, cls, docs, optional_args, required_args)[source]

Bases: benchmarkstt.factory.ClassConfigTuple

property docs

Alias for field number 2

class benchmarkstt.factory.CoreFactory(base_class, allow_duck=None)[source]

Bases: object

classmethod add_supported_namespace(namespace)[source]
create(*args, **kwargs)[source]
is_valid(*args, **kwargs)[source]
keys()[source]
register(*args, **kwargs)[source]
class benchmarkstt.factory.Factory(base_class, namespaces=None, methods=None)[source]

Bases: benchmarkstt.registry.Registry

Factory class with auto-loading of namespaces according to a base class.

create(alias, *args, **kwargs)[source]
is_valid(tocheck)[source]

Checks that tocheck is a valid class extending base_class

Parameters

tocheck -- The class to check

Return type

bool

static normalize_class_name(clsname)[source]

Normalizes the class name for automatic lookup of a class, by default this means lowercasing the class name, but may be overrided by a child class.

Parameters

clsname -- The class name

Returns

The normalized class name

Return type

str

register(cls, alias=None)[source]

Register an alias for a class

Parameters
  • cls (self.base_class) --

  • alias (str|None) -- The alias to use when trying to get the class back, by default will use normalized class name.

Returns

None

register_classname(name, alias=None)[source]
register_namespace(namespace)[source]

Registers all valid classes from a given namespace

Parameters

namespace (str|module) --

benchmarkstt.helpers module

Some helper methods that can be re-used across submodules

benchmarkstt.helpers.make_printable(char)[source]

Return printable representation of ascii/utf-8 control characters

Parameters

char --

Return str

benchmarkstt.modules module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Modules HiddenModuleError Exception <|-- HiddenModuleError class HiddenModuleError { } class Modules { <<mapping>> <<iterable>> <<attributes>> +keys() sub_module=None }
exception benchmarkstt.modules.HiddenModuleError[source]

Bases: Exception

class benchmarkstt.modules.Modules(sub_module=None)[source]

Bases: object

keys()[source]
benchmarkstt.modules.load_object(name, transform=None)[source]

Load an object based on a string.

Parameters
  • name -- The string representation of an object

  • transform -- Transform (callable) done on the object name for comparison, if None, will lowercase compare. False for no transform.

benchmarkstt.registry module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram Registry class Registry { <<mapping>> <<contains>> +keys() +register(key, value) +unregister(key) }
class benchmarkstt.registry.Registry[source]

Bases: object

Simple registry class holding aliases and their corresponding values

keys()[source]
register(key, value)[source]
unregister(key)[source]
benchmarkstt.schema module
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram SchemaJSONError SchemaError Meta Item Schema JSONEncoder SchemaInvalidItemError JSONDecoder ValueError <|-- SchemaError SchemaError <|-- SchemaJSONError SchemaError <|-- SchemaInvalidItemError Mapping <|-- Item defaultdict <|-- Meta Item <.. Schema class SchemaError { } class SchemaJSONError { } class SchemaInvalidItemError { } class Item { <<len>> <<mapping>> <<contains>> <<iterable>> <<comparable>> *args **kwargs +get(key, default=None) +items() +json(**kwargs) +keys() +values() } class Meta { } class Schema { <<len>> <<mapping>> <<iterable>> <<comparable>> +append(obj: Item) +extend(iterable) +json(**kwargs) data=None dump(*args, **kwargs)$ dumps(*args, **kwargs)$ load(*args, **kwargs)$ loads(*args, **kwargs)$ } class JSONEncoder { +default(o) +encode(obj) +iterencode(o, _one_shot=False) skipkeys=False ensure_ascii=True check_circular=True allow_nan=True sort_keys=False indent=None separators=None default=None } class JSONDecoder { *args **kwargs +decode(*args, **kwargs) +raw_decode(s, idx=0) object_hook(obj)$ }

Defines the main schema for comparison and implements json serialization

class benchmarkstt.schema.Item(*args, **kwargs)[source]

Bases: collections.abc.Mapping

Basic structure of each field to compare

Raises

ValueError, SchemaInvalidItemError

json(**kwargs)[source]
class benchmarkstt.schema.JSONDecoder(*args, **kwargs)[source]

Bases: json.decoder.JSONDecoder

Custom JSON decoding for schema

decode(*args, **kwargs)[source]

Return the Python representation of s (a str instance containing a JSON document).

static object_hook(obj)[source]
class benchmarkstt.schema.JSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: json.encoder.JSONEncoder

Custom JSON encoding for schema

default(o)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
encode(obj)[source]

Return a JSON string representation of a Python data structure.

>>> from json.encoder import JSONEncoder
>>> JSONEncoder().encode({"foo": ["bar", "baz"]})
'{"foo": ["bar", "baz"]}'
class benchmarkstt.schema.Meta[source]

Bases: collections.defaultdict

Containing metadata for an item, such as skipped

class benchmarkstt.schema.Schema(data=None)[source]

Bases: object

Basically a list of Item

append(obj: benchmarkstt.schema.Item)[source]
static dump(cls, *args, **kwargs)[source]
static dumps(cls, *args, **kwargs)[source]
extend(iterable)[source]
json(**kwargs)[source]
static load(*args, **kwargs)[source]
static loads(*args, **kwargs)[source]
exception benchmarkstt.schema.SchemaError[source]

Bases: ValueError

Top Error class for all schema related exceptions

exception benchmarkstt.schema.SchemaInvalidItemError[source]

Bases: benchmarkstt.schema.SchemaError

Attempting to add an invalid item

exception benchmarkstt.schema.SchemaJSONError[source]

Bases: benchmarkstt.schema.SchemaError

When loading incompatible JSON

Changelog

[Unreleased]

Added
  • Documentation:

    • add auto-generated UML diagrams

    • add tutorial Jupyter Notebooks

    • add support for loading external/local code (--load) #142

  • Tests:

    • add python 3.8 to github workflow, re-enable excluded python versions

Changed
  • Cleanup/refactors:

    • group cli and api entrypoints in their respective packages

    • moved all documentation specific code outside main package

    • update sphinx to latest

    • use more descriptive names for Base classes (Normalizer, Differ, etc.)

    • rename CLIDiffDialect to ANSIDiffDialect, "cli" -> "ansi"

    • rename NormalizationComposite -> NormalizationAggregate

    • allow ducktyped custom classes to be recognized as valid

    • proper abstract base classes

  • Documentation:

    • custom autodoc templates

  • Normalizer Unidecode and dependency 'Unidecode>=1.1.0' replaced by version working for python3.9

Fixed
  • Makefile:

    • ensure pip is installed (in some cases needed for development, avoids user confusion)

    • use environment python if available, otherwise use python3

  • Dockerfile:

    • fixed missing python package by specifying its version #138

1.0.0 - 2020-04-23

Initial version

Indices and tables