Subcommand metrics

Calculate metrics based on the comparison of a hypothesis with a reference.

usage: benchmarkstt-tools metrics -r REFERENCE -h HYPOTHESIS
                                  [-rt {infer,argument,plaintext}]
                                  [-ht {infer,argument,plaintext}]
                                  [-o {json,markdown,restructuredtext}]
                                  [--beer [entities_file]] [--cer [mode]
                                  [differ_class]] [--diffcounts [mode]
                                  [differ_class]] [--wer [mode]
                                  [differ_class]] [--worddiffs [dialect]
                                  [differ_class]]
                                  [--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
                                  [--load MODULE_NAME [MODULE_NAME ...]]
                                  [--help]

Named Arguments

-r, --reference

File to use as reference

-h, --hypothesis

File to use as hypothesis

-o, --output-format

Possible choices: json, markdown, restructuredtext

Format of the outputted results

Default: "restructuredtext"

--log-level

Possible choices: critical, fatal, error, warn, warning, info, debug, notset

Set the logging output level

Default: warning

--load

Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.

reference and hypothesis types

You can specify which file type the --reference/-r and --hypothesis/-h arguments should be treated as.

Available types:

'infer': Load from a given filename. Automatically infer file type from the filename extension. 'argument': Read the argument and treat as plain text (without reading from file) 'plaintext': Load from a given filename. Treat file as Plain text.

-rt, --reference-type

Possible choices: infer, argument, plaintext

Type of reference file

Default: "infer"

-ht, --hypothesis-type

Possible choices: infer, argument, plaintext

Type of hypothesis file

Default: "infer"

available metrics

A list of metrics to calculate. At least one metric needs to be provided.

--beer

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

--cer

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

param mode

'levenshtein' (default).

param differ_class

For future use.

--diffcounts

Get the amount of differences between reference and hypothesis

--wer

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

param mode

'strict' (default), 'hunt' or 'levenshtein'.

param differ_class

For future use.

--worddiffs

Present differences on a per-word basis

param dialect

Presentation format. Default is 'ansi'.

example dialect

'html'

param differ_class

For future use.