Khalid Yusuf Dahir

Independent researcher on multilingual AI safety and low-resource NLP, studying transformer-based large language models from the linear-algebra layer up. Building the corpora, tokenizers, and safety benchmarks Somali language models will be measured against. Previously architected Somalia's first national Electronic Health Record system, now serving 100+ clinics.

Research
  • SomaliWeb v1 · arXiv:2605.18232 · quality-filtered Somali corpus + BPE-16K tokenizer + Somali LID benchmark
  • SomaliBench v0 · first native-verified Somali safety-evaluation benchmark · 100 probes, en + so
  • multilingual-safety-probe · Llama-3 refusal-rate gradient across 5 languages, ρ=0.97
  • somaliweb-v1 dataset · 819k documents, ~303M tokens on Hugging Face
Writing
Elsewhere