Subcommand metrics

Calculate metrics based on the comparison of a hypothesis with a reference.

usage: benchmarkstt-tools metrics -r REFERENCE -h HYPOTHESIS
                                  [-rt {infer,argument,plaintext}]
                                  [-ht {infer,argument,plaintext}]
                                  [-o {json,markdown,restructuredtext}]
                                  [--beer [entities_file]] [--cer [mode]
                                  [differ_class]] [--diffcounts [mode]
                                  [differ_class]] [--wer [mode]
                                  [differ_class]] [--worddiffs [dialect]
                                  [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                                  [--load MODULE_NAME [MODULE_NAME ...]]

Named Arguments

-r, --reference

File to use as reference

-h, --hypothesis

File to use as hypothesis

-o, --output-format

Possible choices: json, markdown, restructuredtext

Format of the outputted results

Default: "restructuredtext"


Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning


Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

reference and hypothesis types

You can specify which file type the --reference/-r and --hypothesis/-h arguments should be treated as.

Available types:

'infer': Load from a given filename. Automatically infer file type from the filename extension. 'argument': Read the argument and treat as plain text (without reading from file) 'plaintext': Load from a given filename. Treat file as Plain text.

-rt, --reference-type

Possible choices: infer, argument, plaintext

Type of reference file

Default: "infer"

-ht, --hypothesis-type

Possible choices: infer, argument, plaintext

Type of hypothesis file

Default: "infer"

available metrics

A list of metrics to calculate. At least one metric needs to be provided.


Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.


Character Error Rate, basically defined as:

insertions + deletions + substitutions
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.


Get the amount of differences between reference and hypothesis


Word Error Rate, basically defined as:

insertions + deletions + substitions
     number of reference words


Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.


[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: See:

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.


Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect


param differ_class

For future use.