benchmarkstt.normalization.core module¶

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#e7f2fa', 'lineColor': '#2980B9' }}}%% classDiagram ReplaceWords Lowercase Regex ConfigSectionNotFoundError Unidecode Replace Config NormalizerWithFileSupport <|-- Replace NormalizerWithFileSupport <|-- ReplaceWords NormalizerWithFileSupport <|-- Regex Normalizer <|-- Lowercase Normalizer <|-- Unidecode ValueError <|-- ConfigSectionNotFoundError Normalizer <|-- Config class Replace { normalize(text)$ search: str replace: str } class ReplaceWords { normalize(text)$ search: str replace: str } class Regex { normalize(text)$ search: str replace: str } class Lowercase { normalize(text)$ } class Unidecode { normalize(text)$ } class ConfigSectionNotFoundError { } class Config { +refresh_docstring() default_section(section)$ file section=None encoding=None normalize(text)$ }

Some basic/simple normalization classes

class benchmarkstt.normalization.core.Config(file, section=None, encoding=None)[source]¶

Bases: benchmarkstt.normalization.Normalizer

Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.

Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.

The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).

Additional rules:

Normalizer names are case-insensitive.
Arguments MAY be wrapped in double quotes.
If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.
A double quote itself is represented in this quoted argument as two double quotes: "".

The normalization rules are applied top-to-bottom and follow this format:

[normalization]
# This is a comment

# (Normalizer2 has no arguments)
lowercase

# loads regex expressions from regexrules.csv in "utf 8" encoding
regex regexrules.csv "utf 8"

# load another config file, [section1] and [section2]
config configfile.ini section1
config configfile.ini section2

# loads replace expressions from replaces.csv in default encoding
replace     replaces.csv

Parameters

file -- The config file
encoding -- The file encoding
section -- The subsection of the config file to use, defaults to 'normalization'

Example text

"He bravely turned his tail and fled"

Example file

"./resources/test/normalizers/configfile.conf"

Example encoding

"UTF-8"

Example return

"ha bravalY Turnad his tail and flad"

MAIN_SECTION = <object object>¶

_normalize(text: str) → str[source]¶

classmethod default_section(section)[source]¶

classmethod refresh_docstring()[source]¶

exception benchmarkstt.normalization.core.ConfigSectionNotFoundError[source]¶

Bases: ValueError

Raised when a requested config section was not found

class benchmarkstt.normalization.core.Lowercase[source]¶

Bases: benchmarkstt.normalization.Normalizer

Lowercase the text

Example text: "Easy, Mungo, easy... Mungo..."
Example return: "easy, mungo, easy... mungo..."

_normalize(text: str) → str[source]¶

class benchmarkstt.normalization.core.Regex(search: str, replace: str)[source]¶

Bases: benchmarkstt.normalization.NormalizerWithFileSupport

Simple regex replace. By default the pattern is interpreted case-sensitive.

Case-insensitivity is supported by adding inline modifiers.

You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...

Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":

search

replace

(?i)(h)a

\1e

No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.

Eg. would replace "New<CRLF>line" to "newline":

search

replace

(?msi)new.line

newline

Example text: "HAHA! Hahaha!"
Example search: '(?i)(h)a'
Example replace: '\1e'
Example return: "HeHe! Hehehe!"

_normalize(text: str) → str[source]¶

class benchmarkstt.normalization.core.Replace(search: str, replace: str)[source]¶

Bases: benchmarkstt.normalization.NormalizerWithFileSupport

Simple search replace

Parameters

search -- Text to search for
replace -- Text to replace with

Example text

"Nudge nudge!"

Example search

"nudge"

Example replace

"wink"

Example return

"Nudge wink!"

_normalize(text: str) → str[source]¶

class benchmarkstt.normalization.core.ReplaceWords(search: str, replace: str)[source]¶

Bases: benchmarkstt.normalization.NormalizerWithFileSupport

Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..

Parameters

search -- Word to search for
replace -- Replace with

Example text

"She has a heart of formica"

Example search

"a"

Example replace

"the"

Example return

"She has the heart of formica"

_normalize(text: str) → str[source]¶

class benchmarkstt.normalization.core.Unidecode[source]¶

Bases: benchmarkstt.normalization.Normalizer

Unidecode characters to ASCII form, see Python's Unidecode package for more info.

Example text: "𝖂𝖊𝖓𝖓 𝖎𝖘𝖙 𝖉𝖆𝖘 𝕹𝖚𝖓𝖘𝖙ü𝖈𝖐 𝖌𝖎𝖙 𝖚𝖓𝖉 𝕾𝖑𝖔𝖙𝖊𝖗𝖒𝖊𝖞𝖊𝖗?"
Example return: "Wenn ist das Nunstuck git und Slotermeyer?"

_normalize(text: str) → str[source]¶