Subcommand normalizationΒΆ
Apply normalization to given input
usage: benchmarkstt-tools normalization [--log] [-i file] [-o file]
[--config file [section] [encoding]]
[--file normalizer file [encoding]
[path]] [--lowercase]
[--regex search replace]
[--replace search replace]
[--replacewords search replace]
[--unidecode]
[--log-level {critical,fatal,error,warn,warning,info,debug,notset}]
[--load MODULE_NAME [MODULE_NAME ...]]
[--help]
Named ArgumentsΒΆ
- --log
show normalization logs (warning: for large files with many normalization rules this will cause a significant performance penalty and a lot of output data)
Default: False
- --log-level
Possible choices: critical, fatal, error, warn, warning, info, debug, notset
Set the logging output level
Default: warning
- --load
Load external code that may contain additional classes for normalization, etc. E.g. if the classes are contained in a python file named myclasses.py in the directory where your are calling benchmarkstt from, you would pass --load myclasses. All classes that are recognized will be automatically documented in the --help command and available for use.
input and output filesΒΆ
You can provide multiple input and output files, each preceded by -i and -o respectively. If no input file is given, only one output file can be used. If using both multiple input and output files there should be an equal amount of each. Each processed input file will then be written to the corresponding output file.
- -i, --inputfile
read input from this file, defaults to STDIN
- -o, --outputfile
write output to this file, defaults to STDOUT
available normalizersΒΆ
A list of normalizers to execute on the input, can be one or more normalizers which are applied sequentially. The program will automatically find the normalizer in benchmarkstt.normalization.core, then benchmarkstt.normalization and finally in the global namespace.
- --config
Use config file notation to define normalization rules. This notation is a list of normalizers, one per line.
Each normalizer that is based needs a file is followed by a file name of a csv, and can be optionally followed by the file encoding (if different than default). All options are loaded in from this csv and applied to the normalizer.
The normalizers can be any of the core normalizers, or you can refer to your own normalizer class (like you would use in a python import, eg. my.own.package.MyNormalizerClass).
- Additional rules:
Normalizer names are case-insensitive.
Arguments MAY be wrapped in double quotes.
If an argument contains a space, newline or double quote, it MUST be wrapped in double quotes.
A double quote itself is represented in this quoted argument as two double quotes:
""
.
The normalization rules are applied top-to-bottom and follow this format:
[normalization] # This is a comment # (Normalizer2 has no arguments) lowercase # loads regex expressions from regexrules.csv in "utf 8" encoding regex regexrules.csv "utf 8" # load another config file, [section1] and [section2] config configfile.ini section1 config configfile.ini section2 # loads replace expressions from replaces.csv in default encoding replace replaces.csv
- param file
The config file
- param encoding
The file encoding
- param section
The subsection of the config file to use, defaults to 'normalization'
- example text
"He bravely turned his tail and fled"
- example file
"./resources/test/normalizers/configfile.conf"
- example encoding
"UTF-8"
- example return
"ha bravalY Turnad his tail and flad"
- --file
Read one per line and pass it to the given normalizer
- param str|class normalizer
Normalizer name (or class)
- param file
The file to read rules from
- param encoding
The file encoding
- example text
"This is an Ex-Parakeet"
- example normalizer
"regex"
- example file
"./resources/test/normalizers/regex/en_US"
- example encoding
"UTF-8"
- example return
"This is an Ex Parrot"
- --lowercase
Lowercase the text
- example text
"Easy, Mungo, easy... Mungo..."
- example return
"easy, mungo, easy... mungo..."
- --regex
Simple regex replace. By default the pattern is interpreted case-sensitive.
Case-insensitivity is supported by adding inline modifiers.
You might want to use capturing groups to preserve the case. When replacing a character not captured, the information about its case is lost...
Eg. would replace "HAHA! Hahaha!" to "HeHe! Hehehe!":
search
replace
(?i)(h)a
\1e
No regex flags are set by default, you can set them yourself though in the regex, and combine them at will, eg. multiline, dotall and ignorecase.
Eg. would replace "New<CRLF>line" to "newline":
search
replace
(?msi)new.line
newline
- example text
"HAHA! Hahaha!"
- example search
'(?i)(h)a'
- example replace
'\1e'
- example return
"HeHe! Hehehe!"
- --replace
Simple search replace
- param search
Text to search for
- param replace
Text to replace with
- example text
"Nudge nudge!"
- example search
"nudge"
- example replace
"wink"
- example return
"Nudge wink!"
- --replacewords
Simple search replace that only replaces "words", the first letter will be checked case insensitive as well with preservation of case..
- param search
Word to search for
- param replace
Replace with
- example text
"She has a heart of formica"
- example search
"a"
- example replace
"the"
- example return
"She has the heart of formica"
- --unidecode
Unidecode characters to ASCII form, see Python's Unidecode package for more info.
- example text
"ππππ πππ πππ πΉππππΓΌππ πππ πππ πΎππππππππππ?"
- example return
"Wenn ist das Nunstuck git und Slotermeyer?"