Command line tool

usage: benchmarkstt -r REFERENCE -h HYPOTHESIS
                    [-rt {infer,argument,plaintext}]
                    [-ht {infer,argument,plaintext}]
                    [-o {json,markdown,restructuredtext}]
                    [--beer [entities_file]] [--cer [mode] [differ_class]]
                    [--diffcounts [mode] [differ_class]] [--wer [mode]
                    [differ_class]] [--worddiffs [dialect] [differ_class]]
                    [--config file [section] [encoding]]
                    [--file normalizer file [encoding] [path]] [--lowercase]
                    [--regex search replace] [--replace search replace]
                    [--replacewords search replace] [--unidecode] [--log]
                    [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                    [--load MODULE_NAME [MODULE_NAME ...]] [--help]

named arguments

-r, --reference

File to use as reference

-h, --hypothesis

File to use as hypothesis

-o, --output-format

Possible choices: json, markdown, restructuredtext

Format of the outputted results

Default: "restructuredtext"


show normalization logs (warning: for large files with many normalization rules this will cause a significant performance penalty and a lot of output data)

Default: False


Output benchmarkstt version number

Default: False


Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning


Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

reference and hypothesis types

You can specify which file type the --reference/-r and --hypothesis/-h arguments should be treated as.

Available types:

'infer': Load from a given filename. Automatically infer file type from the filename extension. 'argument': Read the argument and treat as plain text (without reading from file) 'plaintext': Load from a given filename. Treat file as Plain text.

-rt, --reference-type

Possible choices: infer, argument, plaintext

Type of reference file

Default: "infer"

-ht, --hypothesis-type

Possible choices: infer, argument, plaintext

Type of hypothesis file

Default: "infer"

available metrics

A list of metrics to calculate. At least one metric needs to be provided.


Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.


Character Error Rate, basically defined as:

insertions + deletions + substitutions
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.


Get the amount of differences between reference and hypothesis


Word Error Rate, basically defined as:

insertions + deletions + substitions
     number of reference words


Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.


[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: See:

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.


Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect


param differ_class

For future use.

available normalizers

A list of normalizers to execute on the input, can be one or more normalizers which are applied sequentially. The program will automatically find the normalizer in benchmarkstt.normalization.core, then benchmarkstt.normalization and finally in the global namespace.


Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.

Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.

The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).

Additional rules:
  • Normalizer names are case-insensitive.

  • Arguments MAY be wrapped in double quotes.

  • If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.

  • A double quote itself is represented in this quoted argument as two double quotes: "".

The normalization rules are applied top-to-bottom and follow this format:

# This is a comment

# (Normalizer2 has no arguments)

# loads regex expressions from regexrules.csv in "utf 8" encoding
regex regexrules.csv "utf 8"

# load another config file, [section1] and [section2]
config configfile.ini section1
config configfile.ini section2

# loads replace expressions from replaces.csv in default encoding
replace     replaces.csv
param file

The config file

param encoding

The file encoding

param section

The subsection of the config file to use, defaults to 'normalization'

example text

"He bravely turned his tail and fled"

example file


example encoding


example return

"ha bravalY Turnad his tail and flad"


Read one per line and pass it to the given normalizer

param str|class normalizer

Normalizer name (or class)

param file

The file to read rules from

param encoding

The file encoding

example text

"This is an Ex-Parakeet"

example normalizer


example file


example encoding


example return

"This is an Ex Parrot"


Lowercase the text

example text

"Easy, Mungo, easy... Mungo..."

example return

"easy, mungo, easy... mungo..."


Simple regex replace. By default the pattern is interpreted case-sensitive.

Case-insensitivity is supported by adding inline modifiers.

You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...

Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":





No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.

Eg. would replace "New<CRLF>line" to "newline":





example text

"HAHA! Hahaha!"

example search


example replace


example return

"HeHe! Hehehe!"


Simple search replace

param search

Text to search for

param replace

Text to replace with

example text

"Nudge nudge!"

example search


example replace


example return

"Nudge wink!"


Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..

param search

Word to search for

param replace

Replace with

example text

"She has a heart of formica"

example search


example replace


example return

"She has the heart of formica"


Unidecode characters to ASCII form, see Python's Unidecode package for more info.

example text

"𝖂𝖊𝖓𝖓 𝖎𝖘𝖙 𝖉𝖆𝖘 𝕹𝖚𝖓𝖘𝖙ü𝖈𝖐 𝖌𝖎𝖙 𝖚𝖓𝖉 𝕾𝖑𝖔𝖙𝖊𝖗𝖒𝖊𝖞𝖊𝖗?"

example return

"Wenn ist das Nunstuck git und Slotermeyer?"


The benchmarkstt command line tool links the different modules (input, normalization, metrics, etc.) in the following way:

CLI flow

Additional tools

Some additional helpful tools are available through benchmarkstt-tools, which provides these subcommands:

Bash completion

Bash completion is supported through argcomplete.