Research

We study language where the data runs out.

Beyondlex Research is a small group working on the parts of language modeling that high-resource AI has had the luxury of ignoring.

Active tracks

Four threads of work, each with its own internal research agenda. Publications and notes will appear here as they mature.

  1. 01

    Dialect representation

    How do you encode the boundary between two dialects when the boundary itself is gradient? Our work treats dialect as a continuous variable, not a label, and asks what that costs and what it buys.

  2. 02

    Low-resource modeling

    Pretraining at scale assumes the data exists. For most languages it does not. We study how far transfer, augmentation, and curated pretraining can carry a model when the corpus is measured in millions of tokens, not billions.

  3. 03

    Culturally grounded evaluation

    An evaluation that doesn't reflect how a language is actually used will reward the wrong models. We build evaluations with the speech communities themselves — not translated benchmarks borrowed from English.

  4. 04

    Alignment in low-resource settings

    Most alignment research presumes English-language norms and English-language reviewers. We study how preference data, refusal behaviors, and instruction-following transfer — and where they break — across dialects.

Publications, in time.

We are deliberate about what we publish and when. As work clears review, papers, datasets, and reproducible artifacts will be linked from this page.