← Community
ActivePython · Rust · TeX

Turkish Morphological Tokenizer

A modern tokenizer that splits text into morphological units faithful to Turkish phonetics and can recombine them.

tokenizermorphologyturkishbenchmarkresearch

Vision

Prove at an academic level - and release as open source - a hybrid tokenization approach that understands Turkish morphology, splits affixes correctly, and improves token efficiency.

What it does

Splits Turkish text into morphological units, performs hybrid tokenization with BPE, and evaluates on STS-TR and MTEB-TR. Includes a high-performance Rust tokenizer implementation.

Contribution areas

  • Extend MTEB-TR evaluation
  • Rust tokenizer performance improvements
  • Morphological adaptation for new languages
  • TurBLiMP evaluation tooling
  • Academic replication and verification

Tech stack: Python · Rust · LaTeX · HuggingFace · MTEB

I want to join this project

Verify your Google account, fill out the form, then pick a task from the GitHub issue list to get started.

Enterprise pilot, API access, investment, and partnership requests require a verified Google account.

Checking session…