Language-Native Embeddings
An open methodology compiling methods and steps for anyone to build efficient tokenizers and embedding models for their own language and domain.
Vision
Define, as open source, the standard method for producing a language-native tokenizer and embedding model for any language or domain. Starting with Turkish, this methodology applies to all underrepresented languages.
Process
Build a benchmark → train BPE tokenizer → frequency analysis → renew the tokenizer → rebuild the embedding layer → align with the original model → continued training in the target language → measure on MTEB.
Contribution areas
- Prepare STSb/MTEB benchmarks for a new language or domain
- BPE tokenizer training and frequency analysis
- Embedding alignment experiments
- Model cards and documentation
Tech stack: Python · HuggingFace · sentence-transformers · MTEB · License: CC BY 4.0
Resources & links
I want to join this project
Verify your Google account, fill out the form, then pick a task from the GitHub issue list to get started.