Tokenizer: update asset loading and initiations

Sentence Piece:

Currently, our tokenizer class loads the sentencepiece model by default, ever when we are not planning to do any NWS word tokenization. In the future, we might have contexts where we load multiple separate sentencepiece models for different languages. So, it is not feasible to load a fixed model at the beginning.

Goal:

Update the tokenizer class
Accommodate dynamic loading of SPC models

Abbreviation list:

Currently we load the entire list of abbreviations in before filtering down to just the particular language, our start-up cost for a Tokenizer object is 5ms vs. 50µs when no abbreviation file is passed.

Goal:

Split up the abbreviation files into language-specific files so only the relevant set is loaded.

Edited Nov 21, 2023 by AKhatun

Admin message

Admin message

Admin message

Tokenizer: update asset loading and initiations