Why we start with Pakistani dialects
A reasonable question, when you say you're going to work on under-served languages: why these ones?
The honest answer is the personal one: this is home. Beyondlex is being built in Pakistan, by people from Pakistan, in conversation with the speech communities here. There is no version of this work in which we start with languages we don't live with.
But that is not the whole answer. The technical case for starting in this region is, we think, the more interesting one — and worth writing down.
Dialect richness, in concrete terms
Pakistan is one of the most dialect-rich regions in the world per square kilometer. A short, deliberately incomplete tour:
- Urdu is itself a register continuum. The Urdu of Karachi, of Lahore, of Hyderabad, and of literary tradition are not the same Urdu. The honorifics shift. The lexical borrowings shift. The boundary with Hindi is, depending on register, anywhere from porous to invisible.
- Punjabi divides into several major dialect groups (Majhi, Pothohari, Lahnda among them) with significant lexical and phonological differences. The Punjabi of Lahore and the Punjabi of Mirpur are mutually intelligible but cleanly distinct.
- Saraiki, spoken across south Punjab, is treated by some classifications as a Punjabi dialect, by others as its own language. The classification is a research question, not a settled fact.
- Sindhi has its own range of regional varieties — and is written in Perso-Arabic script with diacritics that most modeling pipelines lose.
- Pashto divides along a major north–south axis (Yusufzai vs. Kandahari being the canonical poles), with phonological differences that affect speech recognition in ways most ASR systems handle poorly.
- Balochi, Brahui, Hindko, Shina, and Balti each carry their own dialect inventories and their own gaps in the literature.
This is a partial list. Within almost every entry, there are smaller divisions a careful linguist would not collapse. There is enough dialect work here to keep a research lab busy for a long time.
Dialect is where current systems quietly fail
Most language-AI systems treat a language as a single thing. You can verify this by asking a major LLM a casual question in Lahori Punjabi, then in Pothohari, then in transliterated Punjabi using Urdu script — three forms of the same language a speaker would handle without thinking — and observing how its replies differ in quality, accuracy, and apparent comprehension.
Speech recognition is worse. ASR trained on a single dialect of a language can collapse to near-uselessness on a neighboring one. This is not a flaw users complain about loudly, because users in those communities have generally given up on the technology working for them — but the quiet abandonment is itself the failure mode.
Translation has a related problem. Translating a Saraiki sentence by routing through Punjabi or Urdu loses register information that the original speaker had encoded deliberately. A translation that smooths a dialect into its parent language is a translation that has decided the dialect didn't matter.
We think it matters.
Why this is also good for AI more broadly
A research program built around dialect-aware modeling forces a discipline that high-resource AI has been able to skip. You cannot model dialect well by averaging it. You cannot evaluate dialect well by translating from English benchmarks. You cannot align a dialect-aware model by sampling preference data only from English-speaking annotators.
Each of these constraints points at a research question that, in our view, the field is going to have to answer eventually. We think starting with the constraints in place produces better answers than starting with the constraints removed and adding them back.
Pakistani dialects are where we begin because the dialect questions are dense, the speech communities are willing collaborators, and the work is needed. We expect what we learn here to generalize widely. We expect, more than anything, to be surprised — and to write what surprises us down here.
If you are working on or thinking about dialect-rich language AI, in this region or elsewhere, we'd like to hear from you.