The Beyondlex thesis: AGI through language diversity
There are, by the most-cited estimate, around seven thousand living languages on Earth. Modern AI systems serve fewer than fifty of them in any meaningful sense. Most of the rest are absent from training data, absent from evaluation, and absent from the conversations that determine what counts as progress.
Beyondlex is built around a suspicion that this is not a footnote. We think the gap is the work — and that closing it is, plausibly, the same problem as building generally intelligent systems.
This post is the long version of why.
Languages are not labels
A common framing in NLP research treats a language as a tag attached to a row of data: lang=ur, lang=hi, lang=pa. Once tagged, the row joins all the other rows, the model is trained, and metrics are computed per-tag. The unit of analysis is the row.
We think the unit of analysis is the language. A language is not a label on data — it is a system of meaning. Its honorifics encode social relations. Its dialect borders encode geography. Its code-switching encodes affiliation. Its silences, often, encode more than its words. A model that has never seen a language has not just missed words; it has missed a way of thinking.
When a system encounters Urdu, it inherits not only a vocabulary but a grammatical attention to politeness levels (tum / aap / janab) that English speakers learn to handle through tone. When it encounters Pashto, it inherits a semantic field — pashtunwali — that no English word approximates. To serve a language is, at minimum, to be able to reason inside its categories. To serve a dialect is to do that with more precision than the language as a whole would allow.
The case for under-served languages, technically
There is a humanitarian argument for working on under-served languages. We make it. But there is also a technical argument, and we suspect it is the more important one for AI's long-term trajectory.
Modern frontier models are extraordinary in domains they have seen. They are weaker, often dramatically, in domains they have not. The set of domains they have not seen is vast — and concentrated, disproportionately, in the languages outside the top fifty.
Working on those languages forces a research agenda that high-resource AI has had the luxury of postponing:
- Sample efficiency. When you have ten million words instead of ten trillion, every modeling choice that improves data efficiency becomes a research priority rather than a curiosity.
- Representation that respects structure. Tokenization built for English collapses morphological information that languages with rich case systems carry. Fixing this is not optional in a low-resource setting.
- Evaluation that survives translation. Borrowed benchmarks measure translation quality more than native capability. New evaluation has to be co-designed with the speech community itself.
- Alignment beyond English-speaking annotators. Preference data, refusal behaviors, and instruction-following all shift across linguistic and cultural contexts. We're starting to learn how much.
Each of these problems looks niche from the high-resource side. From the low-resource side, each is foundational.
A claim about AGI
The standard story about general intelligence in language models is that scale plus a frontier model class plus some unspecified amount of clever post-training will get us there. In that story, language is incidental — a substrate the model happens to use.
We don't believe it. We think language is closer to a sense organ. The way a system carves its inputs determines the categories in which it can think. A model trained on one carving — call it the English-language-internet carving, with certain dominant cultural norms — has a particular cognitive geometry. It will be very good at reasoning available to that carving, and structurally limited at reasoning available only to others.
Closing that gap is not done by translating training data. Translation lossy-compresses one carving into another. Closing the gap requires modeling the other carvings natively — which requires, at minimum, that the languages they live in be served.
We think this generalizes. The path to general intelligence, in our view, runs through the world's full inventory of ways that humans have organized meaning. AGI through scale alone strikes us as plausible only if you assume the existing scale is representative. It mostly isn't.
What this means for our work
Beyondlex starts with the languages of Pakistan — Urdu, Punjabi, Sindhi, Pashto, Saraiki, Balochi, Hindko, Brahui, Shina, Balti, and the dialect families inside each — because they are our home, because they are richly under-served, and because the dialect richness of this region is one of the most interesting open problems in modern language modeling.
We don't intend to stop there. The set of languages we plan to reach is the set of languages worth reaching, which is approximately the set that exists. It will take a long time. We will publish what we learn as we go.
If a piece of this resonates — as a reader, a collaborator, a partner, or a sceptic — we'd like to hear from you.