Building Khasi-English Word Embeddings for Real-World NLP in Northeast India
In Northeast India, where linguistic diversity is both a cultural asset and a technical challenge, building robust NLP systems requires more than just off-the-shelf models. At MWire Consulting, we’ve been developing Khasi-English word embeddings to power real-world applications—from chatbots and search engines to document classification and voice interfaces.
Why Khasi Matters in NLP
Khasi is spoken by over a million people across Meghalaya, yet it remains underrepresented in mainstream NLP research. Most models trained on English or Hindi corpora fail to capture the nuances of Khasi syntax, morphology, and semantics. This gap limits the usability of AI tools in governance, education, and citizen services.
Our goal: create embeddings that reflect Khasi-English bilingual semantics, enabling downstream tasks like intent detection, translation, and retrieval-augmented generation (RAG).
Our Approach: Corpus First, Embeddings Second
We started by curating a clean, deduplicated corpus of Khasi-English sentence pairs. This involved:
- Regex-based cleaning in VS Code and Python
- Whitespace and punctuation normalization
- Manual filtering of noisy translations and code-mixed entries
Once the corpus was ready, we trained FastText embeddings with subword support—ideal for morphologically rich languages like Khasi. We also experimented with GloVe and Word2Vec, benchmarking them on semantic similarity and clustering tasks.
Key Insights
- Subword models outperform traditional embeddings for Khasi due to its agglutinative structure
- Bilingual embeddings help bridge semantic gaps in code-mixed queries
- Corpus quality matters more than size—especially for low-resource languages
Real-World Applications
These embeddings now power several prototypes:
- A Khasi-English chatbot for citizen services
- A semantic search engine for local governance documents
- A classifier for tourism-related queries in Shillong and Mylliem
We’re also integrating them into RAG pipelines for document Q&A and retrieval tasks on constrained hardware.
What’s Next
If you’re working on NLP for low-resource languages or want to collaborate on regional deployments in Shillong or Meghalaya, let’s connect.