benchmarkstt.metrics.core module

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram BEER Schema Item WordDiffs WER CER DiffCounts Differ Mapping <|-- Item Metric <|-- WordDiffs Metric <|-- WER Metric <|-- CER Metric <|-- DiffCounts Metric <|-- BEER Differ <.. WordDiffs Schema <.. WordDiffs Item <.. Schema Schema <.. WordDiffs Differ <.. WER Schema <.. WER Schema <.. WER Schema <.. CER Schema <.. CER Differ <.. DiffCounts Schema <.. DiffCounts Schema <.. DiffCounts Schema <.. BEER Schema <.. BEER class OpcodeCounts { <<namedtuple>> equal replace insert delete } class Differ { <<abstract>> +get_opcodes() a b } class Item { <<len>> <<mapping>> <<contains>> <<iterable>> <<comparable>> *args **kwargs +get(key, default=None) +items() +json(**kwargs) +keys() +values() } class Schema { <<len>> <<mapping>> <<iterable>> <<comparable>> +append(obj: Item) +extend(iterable) +json(**kwargs) data=None dump(*args, **kwargs)$ dumps(*args, **kwargs)$ load(*args, **kwargs)$ loads(*args, **kwargs)$ } class WordDiffs { +compare(ref: Schema, hyp: Schema) dialect=None differ_class: Differ } class WER { +compare(ref: Schema, hyp: Schema) mode=None differ_class: Differ } class CER { +compare(ref: Schema, hyp: Schema) mode=None differ_class=None } class DiffCounts { +compare(ref: Schema, hyp: Schema) mode=None differ_class: Differ } class BEER { +compare(ref: Schema, hyp: Schema) +compute_beer(list_hypothesis_entity, list_reference_entity) +get_entities() +get_weight() +set_entities(entities) +set_weight(weight) entities_file=None }
class benchmarkstt.metrics.core.BEER(entities_file=None)[source]

Bases: benchmarkstt.metrics.Metric

Bag of Entities Error Rate, BEER, is defined as the error rate per entity with a bag of words approach:

                    abs(ne_hyp - ne_ref)
BEER (entity)   =   ----------------------
                        ne_ref
  • ne_hyp = number of detections of the entity in the hypothesis file

  • ne_ref = number of detections of the entity in the reference file

The WA_BEER for a set of N entities is defined as the weighted average of the BEER for the set of entities:

WA_BEER ([entity_1, ... entity_N) =  w_1*BEER (entity_1)*L_1/L + ... + w_N*BEER (entity_N))*L_N/L

which is equivalent to:

                                    w_1*abs(ne_hyp_1 - ne_ref_1) + ... + w_N*abs(ne_hyp_N - ne_ref_N)
WA_BEER ([entity_1, ... entity_N) = ------------------------------------------------------------------
                                                                    L
  • L_1 = number of occurrences of entity 1 in the reference document

  • L = L_1 + ... + L_N

the weights being normalised by the tool:

  • w_1 + ... + w_N = 1

The input file defines the list of entities and the weight per entity, w_n. It is processed as a json file with the following structure:

{ "entity_1":W_1, "entity_2" : W_2, "entity_3" :W_3 .. }

W_n being the non-normalized weight, the normalization of the weights is performed by the tool as:

            W_n
w_n =   ---------------
        W_1 + ... +W_N

The minimum value for weight being 0.

compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
compute_beer(list_hypothesis_entity, list_reference_entity)[source]
get_entities()[source]
get_weight()[source]
set_entities(entities)[source]
set_weight(weight)[source]
class benchmarkstt.metrics.core.CER(mode=None, differ_class=None)[source]

Bases: benchmarkstt.metrics.Metric

Character Error Rate, basically defined as:

insertions + deletions + substitutions
--------------------------------------
    number of reference characters

Character error rate, CER, compare the differences between reference and hypothesis on a character level. A CER measure is usually lower than WER measure, since words might differ on only one or a few characters, and be classified as fully different.

The CER metric might be useful as a perspective on the WER metric. Word endings might be less relevant if the text will be preprocessed with stemming, or minor spelling mistakes might be acceptable in certain situations. A CER metric might also be used to evaluate a source (an ASR) which output a stream of characters rather than words.

Important: The current implementation of the CER metric ignores whitespace characters. A string like 'aa bb cc' will first be split into words, ['aa','bb','cc'], and then merged into a final string for evaluation: 'aabbcc'.

Parameters
  • mode -- 'levenshtein' (default).

  • differ_class -- For future use.

MODE_LEVENSHTEIN = 'levenshtein'
compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
class benchmarkstt.metrics.core.DiffCounts(mode=None, differ_class: benchmarkstt.diff.Differ = None)[source]

Bases: benchmarkstt.metrics.Metric

Get the amount of differences between reference and hypothesis

MODE_LEVENSHTEIN = 'levenshtein'
compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)benchmarkstt.metrics.core.OpcodeCounts[source]
class benchmarkstt.metrics.core.OpcodeCounts(equal, replace, insert, delete)

Bases: tuple

property delete

Alias for field number 3

property equal

Alias for field number 0

property insert

Alias for field number 2

property replace

Alias for field number 1

class benchmarkstt.metrics.core.WER(mode=None, differ_class: benchmarkstt.diff.Differ = None)[source]

Bases: benchmarkstt.metrics.Metric

Word Error Rate, basically defined as:

insertions + deletions + substitions
------------------------------------
     number of reference words

See: https://en.wikipedia.org/wiki/Word_error_rate

Calculates the WER using one of two algorithms:

[Mode: 'strict' or 'hunt'] Insertions, deletions and substitutions are identified using the Hunt–McIlroy diff algorithm. The 'hunt' mode applies 0.5 weight to insertions and deletions. This algorithm is the one used internally by Python.

See https://docs.python.org/3/library/difflib.html

[Mode: 'levenshtein'] In the context of WER, Levenshtein distance is the minimum edit distance computed at the word level. This implementation uses the Editdistance c++ implementation by Hiroyuki Tanaka: https://github.com/aflc/editdistance. See: https://en.wikipedia.org/wiki/Levenshtein_distance

Parameters
  • mode -- 'strict' (default), 'hunt' or 'levenshtein'.

  • differ_class -- For future use.

DEL_PENALTY = 1
INS_PENALTY = 1
MODE_HUNT = 'hunt'
MODE_LEVENSHTEIN = 'levenshtein'
MODE_STRICT = 'strict'
SUB_PENALTY = 1
compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema) → float[source]
class benchmarkstt.metrics.core.WordDiffs(dialect=None, differ_class: benchmarkstt.diff.Differ = None)[source]

Bases: benchmarkstt.metrics.Metric

Present differences on a per-word basis

Parameters
  • dialect -- Presentation format. Default is 'ansi'.

  • differ_class -- For future use.

Example dialect

'html'

compare(ref: benchmarkstt.schema.Schema, hyp: benchmarkstt.schema.Schema)[source]
benchmarkstt.metrics.core.get_differ(a, b, differ_class: benchmarkstt.diff.Differ)[source]
benchmarkstt.metrics.core.get_opcode_counts(opcodes)benchmarkstt.metrics.core.OpcodeCounts[source]
benchmarkstt.metrics.core.traversible(schema, key=None)[source]