Word tokenization: edge-cases for whitespace-delimited languages

Parentheses/quotes: right now separated from words they contain -- e.g., (word) -> ['(', 'word', ')'] or "word" -> ['"', 'word', '"']. This generally feels like reasonable behavior.
Contractions: right now kept if fully within word, otherwise separated -- i.e. don't -> ["don't"] but somethin' -> ["somethin", "'"]. NOTE: this does not match behavior in French Wikipedia where e.g., if l'Australie is linked, then it's as l'[[Australie]] (example page). Matching that behavior is not impossible but would require e.g., an input set of language-specific rules for abbreviations to split up.
Abbreviations: right now Dr. -> ["Dr", "."] which is unexpected behavior for abbreviations probably. Instead, we might want to carry over our abbreviation list from sentence tokenization and apply it here as well.
Different punctuation schemes -- placeholder for exploration of whether there are edge-cases in specific languages that we should account for.

Edited Feb 20, 2023 by Isaac Johnson

Admin message