Tokenizer: update asset loading and initiations
Sentence Piece:
Currently, our tokenizer class loads the sentencepiece model by default, ever when we are not planning to do any NWS word tokenization. In the future, we might have contexts where we load multiple separate sentencepiece models for different languages. So, it is not feasible to load a fixed model at the beginning.
Goal:
- Update the tokenizer class
- Accommodate dynamic loading of SPC models
Abbreviation list:
Currently we load the entire list of abbreviations in before filtering down to just the particular language, our start-up cost for a Tokenizer object is 5ms vs. 50µs when no abbreviation file is passed.
Goal:
- Split up the abbreviation files into language-specific files so only the relevant set is loaded.