Language-Native Embeddings

An open methodology compiling methods and steps for anyone to build efficient tokenizers and embedding models for their own language and domain.

embeddingstokenizermethodologymtebstsb

Vision

Define, as open source, the standard method for producing a language-native tokenizer and embedding model for any language or domain. Starting with Turkish, this methodology applies to all underrepresented languages.

Process

Build a benchmark → train BPE tokenizer → frequency analysis → renew the tokenizer → rebuild the embedding layer → align with the original model → continued training in the target language → measure on MTEB.

Contribution areas

Prepare STSb/MTEB benchmarks for a new language or domain
BPE tokenizer training and frequency analysis
Embedding alignment experiments
Model cards and documentation

Tech stack: Python · HuggingFace · sentence-transformers · MTEB · License: CC BY 4.0

I want to join this project

Verify your Google account, fill out the form, then pick a task from the GitHub issue list to get started.

Language-Native Embeddings

Vision

Process

Contribution areas

Resources & links

I want to join this project