The 7,000 languages problem
A language rarely dies loudly. It slips out of the schoolyard, then the workplace, then the screen, until one day the last person who thinks in it is gone — and a particular way of seeing the world goes quiet at the same time. By the count kept by the Endangered Languages Project, more than forty percent of the world's roughly seven thousand languages are now at risk of receding this way, many of them within the century.
Modern language AI, meanwhile, works well for a few dozen languages, acceptably for perhaps a hundred, and effectively not at all for the rest. These two facts are usually told as separate stories — one about culture and loss, the other about missing training data, by different people in different rooms. We think they are the same story, and that telling them together is the only way to see either one clearly. This is what we are working on, so it is worth setting down plainly.
How a language dies
A language rarely disappears all at once. It recedes, generation by generation, and UNESCO's Atlas of the World's Languages in Danger describes that recession as a scale. The axis that matters is intergenerational transmission — whether the language is still being handed to children.
- Vulnerable — most children still speak it, but its use is narrowing to certain settings, usually the home.
- Definitely endangered — children no longer learn it as a mother tongue.
- Severely endangered — only grandparents and older speakers use it; the parents' generation may understand it but does not speak it back.
- Critically endangered — the youngest fluent speakers are elderly, and the language has dropped out of daily life.
- Extinct — no one is left who speaks or remembers it.
Read that scale carefully and you notice it is not really about speaker counts. A language with millions of speakers can sit at vulnerable if it is quietly losing domains — the spheres of life in which it gets used. That detail matters more than it first appears, and we will come back to it.
Why languages go
The drivers are rarely mysterious, and they are rarely about the language itself. A language declines when speaking it stops paying — economically, socially, institutionally.
A dominant language takes over the places where the future is decided: the school, the office, the screen, the marketplace. Education is delivered in the majority tongue; media and entertainment arrive in it; migration and urbanization scatter tight-knit speech communities into cities where the lingua franca wins by default. And often a generation comes to feel a quiet shame about the mother tongue — that it marks them as rural, or poor, or behind — so they raise their children in the prestige language as an act of love. Each of these is a domain lost. Enough of them, and transmission breaks.
There is a newer driver, and it is the one we sit closest to: the digital domain. More of ordinary life now happens through software that assumes you speak one of the few languages it was built for. A language you cannot search, dictate, translate, or be understood in by a machine is a language that has lost yet another sphere of use — and domain loss, as the scale above makes clear, is exactly how endangerment begins. The gap in AI is not a neutral absence. It is a force pushing in the wrong direction.
Why the loss matters
It is tempting to treat this as sentiment — a museum case for things that were going to fade anyway. It isn't.
A language is not a swappable set of labels for a shared world. It is a particular way of carving the world up: a system of categories, relationships, and distinctions that took centuries to settle. Much of what a culture knows is stored only in its language and never written down — how to read a coastline, which plants treat which ailments, the genealogies and histories and laws carried in oral tradition. When the last fluent speaker of such a language dies, that knowledge does not transfer to the majority tongue. It is simply gone, and it does not come back.
There is also a quieter loss, harder to price but real: the narrowing of how human beings can think. Every language is evidence of a different solution to the problem of making meaning. A world that converges on a handful of them is a world that has thrown away most of its working notes on what language — and thought — can be.
Preservation, and the harder thing after it
The response to all this has two halves, and they are not the same.
The first is documentation — recording a language while fluent speakers remain, so that something survives even in the worst case. This is the work of catalogs like Ethnologue, which maps the world's living languages, and the Endangered Languages Project, whose catalog of endangered languages is governed by the First Peoples' Cultural Council and the University of Hawaiʻi at Mānoa. Documentation is necessary, and it is not enough. A recorded language with no living speakers is a specimen, not a tongue.
The second, harder half is revitalization — getting a language spoken again, by children, in daily life. It is rarer, but it happens. Hawaiian went from a few hundred native-speaking children to immersion schools and a new generation of speakers; Māori followed a similar arc through "language nests" that put elders and infants in the same room. What these recoveries share is the thing documentation can't supply on its own: they returned the language to use, in the domains where it had been pushed out.
This is where technology stops being a bystander. Built carelessly, AI accelerates concentration — it makes the dominant languages even more convenient and everything else more marginal. Built deliberately, it can hand a language back some of its lost domains: speech recognition that lets people dictate in their own tongue, translation that lets them be read by others, tools that make the language usable in exactly the digital spaces where it was disappearing. The technology is not neutral. It tilts, and we get to choose which way.
The data problem, honestly stated
So why hasn't AI simply served these languages already? Here the "not enough data" line is true, and worth being precise about. The systems that handle English or Mandarin were trained on quantities of text and audio accumulated over decades of the internet being written largely in those languages. A language that came late to digital life, or lives mostly in speech, has no comparable corpus — not because it is small, but because no one ever built it.
The field has three broad ways to close that gap. Transfer learning adapts a big multilingual model to a new language with a little local data — the workhorse behind tools like Whisper — but it fades the further a language sits, in script or sound or grammar, from the ones the model already knows. Synthetic data manufactures more training material through back-translation and generated speech, but it can only photocopy the patterns already present; it cannot invent the registers and regional variants that were never in the seed. Community-sourced data — asking speakers to record and validate their own language, the way Mozilla Common Voice and benchmarks like FLEURS were built — is the only one of the three that creates genuinely new linguistic information.
Community data has a ceiling too, but its ceiling is not technical. It is one of incentive. The data is slow and costly to gather, and the people who can produce it usually have no reason to, because the tools that result are built somewhere else, for someone else, and never come back to the community that supplied the raw material.
Which is where we start
That last gap is the one Beyondlex is built into, and it is the reason the company exists rather than a feature we bolted on. Everything we make rests on one bet: that the data which makes a language usable again has to be gathered with its speakers, not extracted from them. Our products — Dialect Speech and Dialect Translate, with Beyondlex Studio as the workbench beneath them — are the visible end of that bet. The part we are building first is the less visible end: Boli, a platform where native speakers record, transcribe, and validate their own dialects, and watch the tools improve for them as they do. The name is the Urdu and Punjabi word for everyday speech — which is the whole intention.
It is not open yet. When it is, the people who fill it will be the first to benefit from it — that ordering is the whole design. A loop in which the speech community is also the first user is the only version of this we think is worth building, because it is the only one that keeps producing the living variety the other approaches cannot reach.
We begin with Pakistani dialects, and the endangerment frame is part of why. Dialect is where erosion starts — the first domain a speaker quietly drops is often a regional way of speaking, smoothed toward a prestige standard long before the language itself is at risk. A region this dialect-rich, with speech communities willing to collaborate, is the right place to test whether the community-centered loop produces better systems, and better outcomes, than the extract-and-deploy one. We think it does. We expect what we learn here to carry well beyond the languages we start with.
Seven thousand languages, more than forty percent of them at risk, and modern AI serving a few dozen. Filed as a data problem, it sounds like something that resolves itself once the datasets fill in. Told honestly, it is a question of who the technology is built with, and which way it tilts. That is the version of the problem we are working on.
If you are building toward the same question — in this region or elsewhere — we'd like to hear from you.