← Community
ActivePython · Rust · TeX
Turkish Morphological Tokenizer
A modern tokenizer that splits text into morphological units faithful to Turkish phonetics and can recombine them.
tokenizermorphologyturkishbenchmarkresearch
Vision
Prove at an academic level - and release as open source - a hybrid tokenization approach that understands Turkish morphology, splits affixes correctly, and improves token efficiency.
What it does
Splits Turkish text into morphological units, performs hybrid tokenization with BPE, and evaluates on STS-TR and MTEB-TR. Includes a high-performance Rust tokenizer implementation.
Contribution areas
- Extend MTEB-TR evaluation
- Rust tokenizer performance improvements
- Morphological adaptation for new languages
- TurBLiMP evaluation tooling
- Academic replication and verification
Tech stack: Python · Rust · LaTeX · HuggingFace · MTEB
Resources & links
I want to join this project
Verify your Google account, fill out the form, then pick a task from the GitHub issue list to get started.