
Founding cohort·$145 → $120 for the first cohort.·Tuition returns to list once cohort 1 closes.
Language Data Engineering
The quiet work that decides how good a language model gets.
How language data is sourced, cleaned, annotated, and evaluated for low-resource and dialect-rich settings — the work that quietly determines model quality.
For researchers, data leads, and engineers working at the language layer.
You will be able to
do the work.
- Design annotation pipelines and inter-annotator workflows
- Build evaluation suites for dialect-sensitive applications
- Reason about bias, coverage, and consent in language datasets
What we cover.
Prerequisites
- Working familiarity with NLP or ML
- Comfortable with Python and data tooling
- 01
Module 1 — Sourcing and consent
Where language data comes from, what it costs, and what consent actually requires.
- 02
Module 2 — Annotation at scale
Guidelines, inter-annotator agreement, calibration. Designing workflows that hold up.
- 03
Module 3 — Evaluation for dialects
Why standard benchmarks miss the work. Building suites that reflect real usage.
- 04
Module 4 — Bias, coverage, and audit
How datasets fail quietly, and the audits that catch it before a model ships.
Reserve a place.
We open enrollment a few weeks before each cohort starts. Tell us about your work — we'll write to you first when a place opens.
Founding cohort, currently shaping. Join the waitlist to influence the curriculum and lock the founding rate.