← Community
ActiveDokümantasyon / Metodoloji

Language-Native Embeddings

An open methodology compiling methods and steps for anyone to build efficient tokenizers and embedding models for their own language and domain.

embeddingstokenizermethodologymtebstsb

Vision

Define, as open source, the standard method for producing a language-native tokenizer and embedding model for any language or domain. Starting with Turkish, this methodology applies to all underrepresented languages.

Process

Build a benchmark → train BPE tokenizer → frequency analysis → renew the tokenizer → rebuild the embedding layer → align with the original model → continued training in the target language → measure on MTEB.

Contribution areas

  • Prepare STSb/MTEB benchmarks for a new language or domain
  • BPE tokenizer training and frequency analysis
  • Embedding alignment experiments
  • Model cards and documentation

Tech stack: Python · HuggingFace · sentence-transformers · MTEB · License: CC BY 4.0

I want to join this project

Verify your Google account, fill out the form, then pick a task from the GitHub issue list to get started.

Enterprise pilot, API access, investment, and partnership requests require a verified Google account.

Checking session…