We study language where the data runs out.
Beyondlex Research is a small group working on the parts of language modeling that high-resource AI has had the luxury of ignoring.
Active tracks
Four threads of work, each with its own internal research agenda. Publications and notes will appear here as they mature.
- 01
Dialect representation
How do you encode the boundary between two dialects when the boundary itself is gradient? Our work treats dialect as a continuous variable, not a label, and asks what that costs and what it buys.
- 02
Low-resource modeling
Pretraining at scale assumes the data exists. For most languages it does not. We study how far transfer, augmentation, and curated pretraining can carry a model when the corpus is measured in millions of tokens, not billions.
- 03
Culturally grounded evaluation
An evaluation that doesn't reflect how a language is actually used will reward the wrong models. We build evaluations with the speech communities themselves — not translated benchmarks borrowed from English.
- 04
Alignment in low-resource settings
Most alignment research presumes English-language norms and English-language reviewers. We study how preference data, refusal behaviors, and instruction-following transfer — and where they break — across dialects.
Recent essays
Field notes and short essays from the research team. The longer-form work appears here first.
Why we start with Pakistani dialects
There are technically interesting reasons to start anywhere, but Pakistan is where the technical and the personal sit on top of each other. A note on the choice.
The Beyondlex thesis: AGI through language diversity
There are roughly seven thousand living languages. Modern AI serves fewer than fifty in any meaningful sense. We think the gap is not a footnote — it's the work.
Publications, in time.
We are deliberate about what we publish and when. As work clears review, papers, datasets, and reproducible artifacts will be linked from this page.