Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

B. Nyalang
August 29, 2025
No Comments

In Northeast India, where linguistic diversity is both a cultural asset and a technical challenge, building robust NLP systems requires more than just off-the-shelf models. At MWire Consulting, we’ve been developing Khasi-English word embeddings to power real-world applications—from chatbots and search engines to document classification and voice interfaces.

Why Khasi Matters in NLP

Khasi is spoken by over a million people across Meghalaya, yet it remains underrepresented in mainstream NLP research. Most models trained on English or Hindi corpora fail to capture the nuances of Khasi syntax, morphology, and semantics. This gap limits the usability of AI tools in governance, education, and citizen services.

Our goal: create embeddings that reflect Khasi-English bilingual semantics, enabling downstream tasks like intent detection, translation, and retrieval-augmented generation (RAG).

Our Approach: Corpus First, Embeddings Second

We started by curating a clean, deduplicated corpus of Khasi-English sentence pairs. This involved:

Regex-based cleaning in VS Code and Python
Whitespace and punctuation normalization
Manual filtering of noisy translations and code-mixed entries

Once the corpus was ready, we trained FastText embeddings with subword support—ideal for morphologically rich languages like Khasi. We also experimented with GloVe and Word2Vec, benchmarking them on semantic similarity and clustering tasks.

Key Insights

Subword models outperform traditional embeddings for Khasi due to its agglutinative structure
Bilingual embeddings help bridge semantic gaps in code-mixed queries
Corpus quality matters more than size—especially for low-resource languages

Real-World Applications

These embeddings now power several prototypes:

A Khasi-English chatbot for citizen services
A semantic search engine for local governance documents
A classifier for tourism-related queries in Shillong and Mylliem

We’re also integrating them into RAG pipelines for document Q&A and retrieval tasks on constrained hardware.

What’s Next

If you’re working on NLP for low-resource languages or want to collaborate on regional deployments in Shillong or Meghalaya, let’s connect.

Do you have a project in your
mind? Connect with us!

Contact Us

Subscribe

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

Why Khasi Matters in NLP

Our Approach: Corpus First, Embeddings Second

Key Insights

Real-World Applications

What’s Next

3 Ways AI Can Upgrade Your Business in Meghalaya—Starting Today

Northeast India’s AI Moment: Meet Kren-M, the Region’s First Multilingual Foundational AI Model

Leave a Reply Cancel reply

Do you have a project in your mind? Connect with us!

Contact Us

Subscribe

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

Why Khasi Matters in NLP

Our Approach: Corpus First, Embeddings Second

Key Insights

Real-World Applications

What’s Next

3 Ways AI Can Upgrade Your Business in Meghalaya—Starting Today

Northeast India’s AI Moment: Meet Kren-M, the Region’s First Multilingual Foundational AI Model

Leave a Reply Cancel reply

Do you have a project in your
mind? Connect with us!