Logo

Do you have a project in your
mind? Connect with us!

Contact Us

  • ++91 94851 84375
  • kapyrtheiai@gmail.com
  • Pohktieh, Shillong, Meghalaya, East Khasi Hills, 793014

Subscribe

Join our AI movement and stay updated with the latest insights, events, and opportunities from KaPyrtheiAi!

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

Building Khasi-English Word Embeddings for Real-World NLP in Northeast India

In Northeast India, where linguistic diversity is both a cultural asset and a technical challenge, building robust NLP systems requires more than just off-the-shelf models. At MWire Consulting, we’ve been developing Khasi-English word embeddings to power real-world applications—from chatbots and search engines to document classification and voice interfaces.

Why Khasi Matters in NLP

Khasi is spoken by over a million people across Meghalaya, yet it remains underrepresented in mainstream NLP research. Most models trained on English or Hindi corpora fail to capture the nuances of Khasi syntax, morphology, and semantics. This gap limits the usability of AI tools in governance, education, and citizen services.

Our goal: create embeddings that reflect Khasi-English bilingual semantics, enabling downstream tasks like intent detection, translation, and retrieval-augmented generation (RAG).

Our Approach: Corpus First, Embeddings Second

We started by curating a clean, deduplicated corpus of Khasi-English sentence pairs. This involved:

  • Regex-based cleaning in VS Code and Python
  • Whitespace and punctuation normalization
  • Manual filtering of noisy translations and code-mixed entries

Once the corpus was ready, we trained FastText embeddings with subword support—ideal for morphologically rich languages like Khasi. We also experimented with GloVe and Word2Vec, benchmarking them on semantic similarity and clustering tasks.

Key Insights

  • Subword models outperform traditional embeddings for Khasi due to its agglutinative structure
  • Bilingual embeddings help bridge semantic gaps in code-mixed queries
  • Corpus quality matters more than size—especially for low-resource languages

Real-World Applications

These embeddings now power several prototypes:

  • A Khasi-English chatbot for citizen services
  • A semantic search engine for local governance documents
  • A classifier for tourism-related queries in Shillong and Mylliem

We’re also integrating them into RAG pipelines for document Q&A and retrieval tasks on constrained hardware.

What’s Next

If you’re working on NLP for low-resource languages or want to collaborate on regional deployments in Shillong or Meghalaya, let’s connect.

Leave a Reply

Your email address will not be published. Required fields are marked *