Research

We study language where the data runs out.

Beyondlex Research is a small group working on the parts of language modeling that high-resource AI has had the luxury of ignoring.

Active tracks

Four threads of work, each with its own internal research agenda. Publications and notes will appear here as they mature.

01
Dialect representation
How do you encode the boundary between two dialects when the boundary itself is gradient? Our work treats dialect as a continuous variable, not a label, and asks what that costs and what it buys.
02
Low-resource modeling
Pretraining at scale assumes the data exists. For most languages it does not. We study how far transfer, augmentation, and curated pretraining can carry a model when the corpus is measured in millions of tokens, not billions.
03
Culturally grounded evaluation
An evaluation that doesn't reflect how a language is actually used will reward the wrong models. We build evaluations with the speech communities themselves — not translated benchmarks borrowed from English.
04
Alignment in low-resource settings
Most alignment research presumes English-language norms and English-language reviewers. We study how preference data, refusal behaviors, and instruction-following transfer — and where they break — across dialects.

Blog

Field notes and short essays from the research team. The longer-form work appears here first.

We are deliberate about what we publish and when. As work clears review, papers, datasets, and reproducible artifacts will be linked from this page.