Word tokenization: edge-cases for whitespace-delimited languages
-
Parentheses/quotes: right now separated from words they contain -- e.g., (word)
->['(', 'word', ')']
or"word"
->['"', 'word', '"']
. This generally feels like reasonable behavior. -
Contractions: right now kept if fully within word, otherwise separated -- i.e. don't
->["don't"]
butsomethin'
->["somethin", "'"]
. NOTE: this does not match behavior in French Wikipedia where e.g., ifl'Australie
is linked, then it's asl'[[Australie]]
(example page). Matching that behavior is not impossible but would require e.g., an input set of language-specific rules for abbreviations to split up. -
Abbreviations: right now Dr.
->["Dr", "."]
which is unexpected behavior for abbreviations probably. Instead, we might want to carry over our abbreviation list from sentence tokenization and apply it here as well. -
Different punctuation schemes -- placeholder for exploration of whether there are edge-cases in specific languages that we should account for.