Subcommand metrics¶
Calculate metrics based on the comparison of a hypothesis with a reference.
usage: benchmarkstt-tools metrics -r REFERENCE -h HYPOTHESIS
[-rt {infer,argument,plaintext}]
[-ht {infer,argument,plaintext}]
[-o {json,markdown,restructuredtext}]
[--beer [entities_file]] [--cer [mode]
[differ_class]] [--diffcounts [mode]
[differ_class]] [--wer [mode]
[differ_class]] [--worddiffs [dialect]
[differ_class]]
[--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
[--load MODULE_NAME [MODULE_NAME ...]]
[--help]
Named Arguments¶
- -r, --reference
File to use as reference
- -h, --hypothesis
File to use as hypothesis
- -o, --output-format
Possible choices: json, markdown, restructuredtext
Format of the outputted results
Default: "restructuredtext"
- --log-level
Possible choices: critical, fatal, error, warn, warning, info, debug, notset
Set the logging output level
Default: warning
- --load
Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.
reference and hypothesis types¶
You can specify which file type the --reference/-r and --hypothesis/-h arguments should be treated as.
- Available types:
'infer': Load from a given filename. Automatically infer file type from the filename extension. 'argument': Read the argument and treat as plain text (without reading from file) 'plaintext': Load from a given filename. Treat file as Plain text.
- -rt, --reference-type
Possible choices: infer, argument, plaintext
Type of reference file
Default: "infer"
- -ht, --hypothesis-type
Possible choices: infer, argument, plaintext
Type of hypothesis file
Default: "infer"
available metrics¶
A list of metrics to calculate. At least one metric needs to be provided.
- --beer
Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:
abs(ne_hyp - ne_ref) BEER (entity) = ---------------------- ne_ref
ne_hyp = number of detections of the entity in the hypothesis file
ne_ref = number of detections of the entity in the reference file
The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:
WA_BEER ([entity_1, ... entity_N) = w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L
which is equivalent to:
w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N) WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------ L
L_1 = number of occurrences of entity 1 in the reference document
L = L_1 + ... + L_N
the weights being normalised by the tool:
w_1 + ... + w_N = 1
The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:
{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }
W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:
W_n w_n = --------------- W_1 + ... +W_N
The minimum value for weight being 0.
- --cer
Character Error Rate, basically defined as:
insertions + deletions + substitutions -------------------------------------- number of reference characters
Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.
The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.
Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.
- param mode
'levenshtein' (default).
- param differ_class
For future use.
- --diffcounts
Get the amount of differences between reference and hypothesis
- --wer
Word Error Rate, basically defined as:
insertions + deletions + substitions ------------------------------------ number of reference words
See: https://en.wikipedia.org/wiki/Word_error_rate
Calculates the WER using one of two algorithms:
[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.
See https://docs.python.org/3/library/difflib.html
[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance
- param mode
'strict' (default), 'hunt' or 'levenshtein'.
- param differ_class
For future use.
- --worddiffs
Present differences on a per-word basis
- param dialect
Presentation format. Default is 'ansi'.
- example dialect
'html'
- param differ_class
For future use.