Turkish Morphological Tokenizer

A modern tokenizer that splits text into morphological units faithful to Turkish phonetics and can recombine them.

tokenizermorphologyturkishbenchmarkresearch

Vision

Prove at an academic level - and release as open source - a hybrid tokenization approach that understands Turkish morphology, splits affixes correctly, and improves token efficiency.

What it does

Splits Turkish text into morphological units, performs hybrid tokenization with BPE, and evaluates on STS-TR and MTEB-TR. Includes a high-performance Rust tokenizer implementation.

Contribution areas

Extend MTEB-TR evaluation
Rust tokenizer performance improvements
Morphological adaptation for new languages
TurBLiMP evaluation tooling
Academic replication and verification

Tech stack: Python · Rust · LaTeX · HuggingFace · MTEB

I want to join this project

Verify your Google account, fill out the form, then pick a task from the GitHub issue list to get started.

Turkish Morphological Tokenizer

Vision

What it does

Contribution areas

Resources & links

I want to join this project