# Tony Kipkemboi > AI Engineer and Content Creator specializing in AI automations, agent systems, and developer education. US Army veteran. Former CrewAI, Snowflake, Bloomberg, Booz Allen Hamilton. I build AI automations and agent systems that help teams work smarter. I create technical content about AI across social platforms including YouTube, where my most popular video on building PDF RAG systems with Ollama has 189K+ views. I've spoken at PyCon US, ODSC, Harvard Kennedy School, IBM TechXchange, MLOps World, and more. I'm passionate about open-source software, AI agents, and developer education. ## Contact & Social - Website: https://tonykipkemboi.com - GitHub: https://github.com/tonykipkemboi - YouTube: https://www.youtube.com/@tonykipkemboi - LinkedIn: https://linkedin.com/in/tonykipkemboi - X/Twitter: https://x.com/tonykipkemboi - RSS Feed: https://tonykipkemboi.com/rss ## Expertise - AI Agents & Multi-Agent Systems (CrewAI, LangChain, LlamaIndex) - RAG (Retrieval-Augmented Generation) - Python, Streamlit, Next.js - Local LLMs (Ollama, Groq) - Developer Education & Technical Content Creation ## Background - US Army Veteran - Former Developer Advocate at CrewAI - Former Snowflake - Former Bloomberg - Former Booz Allen Hamilton - University of Pennsylvania (Penn Engineering) --- ## Open Source Projects ### Ollama PDF RAG (520 GitHub stars) A locally-hosted RAG (Retrieval-Augmented Generation) system that allows users to chat with their PDF documents using Ollama and LangChain. Features include document chunking, vector embeddings, and semantic search. - Link: https://github.com/tonykipkemboi/ollama_pdf_rag - GitHub: https://github.com/tonykipkemboi/ollama_pdf_rag - Tech: Python, Ollama, LangChain, Streamlit, ChromaDB ### CrewAI Gmail Automation (189 GitHub stars) Automate Gmail inbox management using CrewAI agents. Intelligently categorizes, responds to, and organizes emails using AI-powered workflows. - Link: https://github.com/tonykipkemboi/crewai-gmail-automation - GitHub: https://github.com/tonykipkemboi/crewai-gmail-automation - Tech: Python, CrewAI, Gmail API, LangChain ### Resume Optimization Crew (149 GitHub stars) AI-powered resume optimization system using CrewAI. Analyzes and enhances resumes to match job descriptions and ATS requirements. - Link: https://github.com/tonykipkemboi/resume-optimization-crew - GitHub: https://github.com/tonykipkemboi/resume-optimization-crew - Tech: Python, CrewAI, AI Optimization ### Trip Planner Agent (142 GitHub stars) CrewAI agents that can plan your vacation. Uses multi-agent collaboration to create detailed itineraries based on your preferences. - Link: https://github.com/tonykipkemboi/trip_planner_agent - GitHub: https://github.com/tonykipkemboi/trip_planner_agent - Tech: Python, CrewAI, Streamlit, LangChain ### Streamlit Replicate Image App (103 GitHub stars) Image generation application built with Streamlit and Replicate API. Generate AI images using various models through an intuitive interface. - Link: https://github.com/tonykipkemboi/streamlit-replicate-img-app - GitHub: https://github.com/tonykipkemboi/streamlit-replicate-img-app - Tech: Python, Streamlit, Replicate, Image Generation ### Groq Streamlit Demo (85 GitHub stars) Demo showcasing Groq's ultra-fast LLM inference with Streamlit. Experience lightning-fast AI responses in an interactive web interface. - Link: https://github.com/tonykipkemboi/groq_streamlit_demo - GitHub: https://github.com/tonykipkemboi/groq_streamlit_demo - Tech: Python, Groq, Streamlit, LLM ### Ollama Streamlit Demos (82 GitHub stars) Collection of Streamlit demos showcasing various Ollama local LLM capabilities. Run AI models locally with no API keys required. - Link: https://github.com/tonykipkemboi/ollama_streamlit_demos - GitHub: https://github.com/tonykipkemboi/ollama_streamlit_demos - Tech: Python, Ollama, Streamlit, Local LLM ### CrewAI Streamlit Demo (73 GitHub stars) Demo showcasing how to output CrewAI agent task outputs on the Streamlit UI. - Link: https://github.com/tonykipkemboi/crewai-streamlit-demo - GitHub: https://github.com/tonykipkemboi/crewai-streamlit-demo - Tech: Python, CrewAI, Streamlit ### Research Paper to Podcast (69 GitHub stars) Automated system that transforms academic research papers into engaging podcast conversations using CrewAI and ElevenLabs. - Link: https://github.com/tonykipkemboi/research-paper-to-podcast - GitHub: https://github.com/tonykipkemboi/research-paper-to-podcast - Tech: Python, CrewAI, ElevenLabs ### YouTube Yapper Trapper (68 GitHub stars) Extract and analyze YouTube video transcripts. Perfect for researchers, content creators, and anyone who wants to quickly digest video content. - Link: https://github.com/tonykipkemboi/youtube_yapper_trapper - GitHub: https://github.com/tonykipkemboi/youtube_yapper_trapper - Tech: Python, YouTube API, Transcription --- ## Speaking & Media Appearances ### MLOps World Conference - Austin - Type: talk - Source: MLOps World - Date: 2025-10-08 - Link: https://mlopsworld.com/speakers/ Speaker demonstrating how agent orchestration, paired with rigorous evaluation, accelerates the path from prototype to production. ### IBM TechXchange Conference - Type: talk - Source: IBM - Date: 2025-10-06 - Link: https://www.linkedin.com/posts/tonykipkemboi_ibmtechxchange-activity-7381001218820681728-Njde/ Speaker discussing AI agents and enterprise AI adoption strategies. ### Building AI Agents with CrewAI - DataCamp Course - Type: course - Source: DataCamp - Date: 2025-10-01 - Link: https://www.datacamp.com/courses/building-ai-agents-with-crewai Comprehensive course teaching developers how to build AI agent systems with CrewAI. ### Creating a Podcast Generation AI Multi-Agent - DataCamp Code-Along - Type: course - Source: DataCamp - Date: 2025-08-13 - Link: https://www.datacamp.com/code-along/creating-a-podcast-generation-ai-multi-agent-with-crew-ai Interactive code-along tutorial teaching how to use CrewAI to build a multi-agent system. ### ODSC AI X Podcast - AI Agents - Type: podcast - Source: Open Data Science Conference - Date: 2025-06-11 - Link: https://podcasts.apple.com/us/podcast/odsc-east-2025-minisodes/id1721516836?i=1000712490491 Featured on ODSC's AI X Podcast discussing foundational AI agent building skills. ### Convergence 2025 - GenAI Engineering Conference - Type: talk - Source: Comet ML - Date: 2025-05-13 - Link: https://www.comet.com/site/about-us/news-and-events/events/convergence-2025/ Speaking at Comet's virtual conference on GenAI Engineering. ### Build agentic systems with CrewAI and Amazon Bedrock - Type: article - Source: AWS Machine Learning Blog - Date: 2025-03-31 - Link: https://aws.amazon.com/blogs/machine-learning/build-agentic-systems-with-crewai-and-amazon-bedrock/ Co-authored AWS ML Blog post on building agentic systems with CrewAI and Amazon Bedrock. ### ODSC East 2025 Workshop - Type: talk - Source: Open Data Science Conference - Date: 2025-05-13 - Link: https://odsc.com/boston/ Led workshop on 'Agentic AI in Action: Build Autonomous, Multi-Agent Systems Hands-On in Python'. ### Guest Lecture at Harvard Kennedy School - Type: talk - Source: Harvard University - Date: 2025-02-27 - Link: https://www.linkedin.com/posts/tonykipkemboi_aiagents-hks-activity-7301069792810008576-H7os/ Guest speaker on AI agents for Prof. Hu's data and information visualization class. ### PyCon US 2024 - Type: talk - Source: PyCon US - Date: 2024-05-15 - Link: https://us.pycon.org/2024/speaker/profile/90/index.html Selected speaker at PyCon US 2024, the largest annual gathering for the Python programming community. --- ## Blog Posts ### Fine-tuning Whisper on Kalenjin: a $25 LoRA experiment - Published: 2026-04-27 - Category: Speech AI - Tags: Whisper, LoRA, Kalenjin, Speech Recognition, Low-Resource Languages, African Languages, PEFT, Modal - URL: https://tonykipkemboi.com/blog/fine-tuning-whisper-kalenjin Whisper does not speak my first language. So I trained a LoRA adapter on it for a weekend, on a $25 GPU budget. Here is everything that broke, every number that surprised me, and the lesson I came away with. #### Full Content ![Magazine-style collage celebrating Kalenjin language and running heritage: handwritten Kalenjin proverbs, an IPA chart of Nilotic sounds, the Nilotic language family tree, a map of the Kenyan Rift Valley, an age-set (ipinda) cycle diagram, vintage Kipchoge Keino race photos, marathon split tables, an altitude vs VO₂max chart, beadwork, a Korosiot drum, and modern ML elements — Mel spectrograms, audio waveforms, a neural-net diagram, a model card. Centered title: "Ng'alal Kalenjin · KELE · SOOMET · LOGOEK".](/blog/kalenjin-hero.png) > *Adding a 100th language to a 99-language model, for the cost of a steak dinner.* Kalenjin is spoken by roughly 6 million people in Kenya. It's my first language. It's also not one of the 99 languages OpenAI's Whisper was trained on. Feed a Kalenjin recording to Whisper and you get confident nonsense: phonetic mush, repetition loops, occasionally words with Icelandic diacritics. That's the practical problem. Voice interfaces, captioning, accessibility tools, transcription for archives and oral histories — they all quietly exclude anyone who speaks outside the 99. Usually the fix requires either a big lab to add your language to their next release, or training a new model from scratch. Neither is accessible to an individual. There's a third option, which is the subject of this post: train a small adapter on top of the existing model. Retraining Whisper from scratch would mean weeks on a GPU cluster for 809M parameters. LoRA (Low-Rank Adaptation) adds a few million trainable parameters on top of a frozen base model. Think of it like a sticky note on a textbook: the textbook stays the same, the sticky note teaches the reader the new thing it needs to know for this specific page. A few million parameters, a few hours on a rented GPU, a few dollars. You end up with a model that handles your language and still handles everything the base model already knew. Here's how that went for Kalenjin, end to end. Budget target: ~$24. Weekend. Solo. A quick glossary before we dive in. Read these once, then follow along. ### WER — Word Error Rate The percentage of *words* a model gets wrong. Measured as the number of word-level edits (substitutions + deletions + insertions) needed to turn the model's output into the reference transcript, divided by the reference word count. - **0% = perfect.** Every word matches. - **100% = every word is wrong or missing.** - **Over 100%** happens when the model invents extra words the reference didn't have. An untrained model hallucinating `njia njiawe ndo njiawe ndo njiawe ndo...` for 50 repetitions can easily produce 200-500% WER. WER counts **whole words**. One letter wrong = the whole word is wrong. One extra space splitting a word in two = two errors. ### CER — Character Error Rate The same calculation as WER but at the *character* level. Edit distance of characters, divided by reference character count. - **0% = every character matches.** - **100% = every character is wrong.** ### Why both, with Kalenjin examples For English, WER and CER usually track each other. For Kalenjin — agglutinative, with inconsistent orthographic conventions — they diverge sharply. Three real examples from our eval data: **Example 1 — word-boundary split.**
REFChopchinkee
PREDChopchin gei
The model heard the right sound but dropped a space in a different place. To a Kalenjin speaker listening back, these are the same. To WER, this is 100% error (one reference word, two "wrong" predicted words). To CER, 8 of the 11 characters match — about 25% error. **Example 2 — orthographic variant.**
REFngokenik
PREDngogenik
One letter different (`k` vs `g`). Sounds identical to a native ear; both spellings appear in Kalenjin writing. WER counts this as 100% wrong. CER counts it as 1 character off in 8 — about 12.5%. **Example 3 — proper noun.**
REFBosnia
PREDOsinia
The model heard the phonemes but chose a different spelling. WER: 100%. CER: 2 characters off in 6 — about 33%. Across hundreds of clips, small orthographic mismatches add up to a WER floor we can't cross, even on predictions a native reader would call correct. CER floats below that floor and tracks closer to what a human ear perceives. For African agglutinative languages specifically, published work (Ali & Renals; "Advocating CER" 2024) finds CER correlates ~5 points better with human judgment than WER. I still report WER because every ASR paper does and readers want comparable numbers. CER is the one I'd pick if I only got one. ### Coverage A metric I ended up introducing after discovering that the model was transcribing only half the audio and WER didn't flag it. `coverage = predicted_word_count / reference_word_count` - `1.0` = the model produced roughly as many words as the reference expected - `< 0.6` = severe truncation (model gave up early) - `> 1.4` = over-generating (usually a repetition loop) Coverage doesn't measure quality. It measures *how much* the model said. Paired with CER it's a much better diagnostic than WER alone. WER catches "wrong half" but misses "missing half" — coverage catches missing-half in one number. ### Base model, adapter, shard - **Base model:** The pre-trained Whisper checkpoint I started from. Weights are frozen during my training. - **Adapter:** The small set of new weights LoRA learns on top of the base. Stored as a ~50 MB file. At inference, base + adapter = Kalenjin-specialized model. - **Shard:** A single parquet file containing a few thousand audio clips plus transcripts. Anv-ke/Kalenjin is 204 shards, ~450 hours of audio. ### Reproducibility — the canonical recipe + metrics file Every WER/CER/coverage number in the verdict and the comparison tables is reproduced under one consistent recipe so they're directly comparable. The recipe and the source-of-truth file: **Recipe (chunked + beam=5, the recommended decoding setup):** ```python generate_kwargs = dict( language="sw", task="transcribe", num_beams=5, chunk_length_s=30, stride_length_s=(5, 5), return_timestamps=True, compression_ratio_threshold=1.35, no_repeat_ngram_size=3, repetition_penalty=1.15, temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0), ) ``` **Normalization (applied identically to refs and predictions before scoring):** - lowercase - replace `[cs]` with space (preserves the inner code-switched word) - drop `[pause]`, `[sigh]`, `[laugh]`, `[breath]`, `[noise]`, `[silence]` - strip standalone punctuation `.,;:?!"/\\` - collapse whitespace **Source of truth:** [`canonical_metrics.json`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/artifacts/canonical_metrics.json). Every number in the verdict table and the dialect tables is reproduced from this single file. The code that computes it lives at [`audit_recompute.py`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/audit_recompute.py). If a number elsewhere in the post differs, it's a *story-beat* number — i.e., what the eval said at an earlier point in the project under a different recipe. Those tables are explicitly labeled with their recipe (`G-seq`, `G-chunk`, or `B5-chunk`). --- ## Story beats ### Day zero - MacBook Air M2, 16 GB RAM. Fine for data exploration, not for training. - Compute: Modal with $30 in starter credits, A10G GPU at $1.10/hr. Budget target: ~$24 total. - Model: `openai/whisper-large-v3-turbo`. 809M params, 4-layer decoder. The turbo variant trades a little accuracy for roughly 6x inference speed. - Dataset: [`Anv-ke/Kalenjin`](https://huggingface.co/datasets/Anv-ke/Kalenjin), curated by the African Next Voices team in Kenya. Gated on HuggingFace; access request had been pending a week. Got approved on 2026-04-22. 143.86 GB of audio and transcripts. This project literally does not happen without their work. Full acknowledgements below. ### The data The AfriVoices-KE Kalenjin dataset is bigger than I expected. 143.86 GB across 204 parquet shards. Roughly 270,000 clips at ~450 hours of audio. Two dialects (Kipsigis 56.7%, Nandi 43.3%), two collection modes (scripted readings vs. unscripted spontaneous speech), and eleven domain tags ranging from Healthcare to Agriculture to Digital Government Services. The first five random clips I pulled were all medical — `Dandy Walker cyst`, `Medullary cystic kidney disease`, `astrocytomas`. Curious whether that was representative or just a sampling artifact, I queried the full `domain` column. The real breakdown: **Scripted** (read speech): Everyday Scenarios 72%, Unspecified 12%, Healthcare 11%, News 3%, Agriculture 1%, rest under 1%. **Unscripted** (spontaneous): Agriculture and Food 53%, Everyday 12%, Digital Government 9%, Extempore Stories 7%, Financial 5%, Healthcare 4%, Customer Care 3%, the rest under 2% each. ![Anv-ke/Kalenjin dataset domain composition — scripted (read) speech is 72% Everyday content; unscripted (spontaneous) is 53% Agriculture & Food.](/blog/kalenjin-dataset-composition.png) More interesting: transcripts contain `[cs]` markers and inline English words (`[cs]Reverend[cs]`, `Rose Cheboi`, `scaman`, `huduma`). This is real Kenyan speech. Code-switching between Kalenjin, English, and sometimes Swahili is the default. Nobody in Eldoret or Kericho speaks in a single-language bubble. The model has to handle it gracefully, or it's not useful on actual recordings. ### Which Kalenjin? A note that matters before any results, because "Kalenjin" is an umbrella, not a single language. The Kalenjin people are a cluster of related Nilotic communities in Kenya. Common groupings include Kipsigis, Nandi, Tugen, Keiyo, Marakwet, Sabaot, Pokot, Terik, and Sengwer. Each has its own distinct dialect, with mutual intelligibility ranging from "obvious" to "not really." The AfriVoices-KE corpus uses a binary `dialect` label: `KIPSIGIS` or `NANDI`. From reading every `meta.csv` across all six splits: - **369 Kipsigis speakers** total across train, dev, and dev_test - **275 Nandi speakers** total - **8 speakers with empty dialect labels** - 652 unique speakers in total That's the entire taxonomy the dataset commits to. The catch: speakers were recorded across counties that include other Kalenjin sub-tribes' homelands. **Elgeyo-Marakwet** county (31 speakers) is Keiyo and Marakwet territory. **Baringo** (18 speakers) is Tugen territory. **Trans-Nzoia** (2 speakers) is partly Sabaot. Those speakers are present in the data, but their actual sub-tribe was rolled into "KIPSIGIS" or "NANDI" when the corpus was labeled. So three things are true at once, and I'd rather state them all clearly than pick a clean-sounding one: 1. The model is trained on Kalenjin audio — it speaks the umbrella. 2. The dataset's two-way dialect label is **Kipsigis and Nandi only**. Everything else gets aggregated into one of those two buckets, or into the empty-label group. 3. **Pokot, Terik, Sengwer**, and most likely **Sabaot** speakers are essentially absent from this dataset. A speaker from those communities trying the model should expect noticeably worse accuracy than a Kipsigis or Nandi speaker. I'll call the model and the project "Kalenjin" because that's what it serves and what the underlying dataset claims. But "Kalenjin" here means specifically what's measurable from this corpus, not a guarantee that every Kalenjin sub-tribe is equally well-served. When I report dialect-stratified numbers later in this post, "Kipsigis vs Nandi" is what the dataset gave me to slice on, not the full sub-tribe picture. ### The baseline Before training, I wanted to know how bad plain Whisper is on Kalenjin. I ran it on 200 clips from the held-out `dev_test/scripted` split, forcing the language token to `sw` (Swahili — the closest Bantu-family language Whisper actually knows). **Word Error Rate: 124%.** Over 100% because when the model hallucinates more tokens than the reference has, word-level edits exceed reference length. A few samples:
REFIle nyanyawet ok ile mi barak ng'enda ketesyin chumbikab temisyet.
PREDnjia njiawe ndo njiawe ndo njiawe ndo njiawe ndo njiawe ndo ... (50+ repetitions)
REFkabatishiet kokonu tuguk eng sabet
PREDHvað væt í séttra ronu, tóku engðar veit?
REFMuch konyaa kora petet ap kawek che ipu cancer alak.
PREDMuzko njaka rafi ti tafkawi teipu kanta alak. (partial phonetic match; the model preserves the English loanword "cancer" as "kanta")
Whisper is guessing. Loops, Icelandic, phonetic mush. The floor is the floor. Any reasonable fine-tune should demolish this baseline. ### Four launches before a training step Here's a thing most ML writeups skip: the part between writing the script and getting the first usable number. The "run this script, get results" narrative is the version you tell in a talk. The real version is uglier, more educational, and usually the most useful part to read. I ran seven launches before I had a working training loop. Six of them failed. Here's what each one taught me. **Launch 1. Dtype mismatch.** I loaded the base model in `bf16` for memory efficiency but fed it `float32` mel features from the HF feature extractor. First conv layer choked. Wasted ~$0.02 of Modal time because I didn't read my own code before running it. My reviewer (me, two minutes later) would have caught this in 30 seconds. **Launch 2. HF `datasets` Arrow cache mangles binary payloads.** My data pipeline used `Dataset.from_list([{"audio_bytes": ...}])` then `.map(decode_fn)` to lazily decode WAV bytes into mel features. The same raw bytes decoded fine in a standalone Python REPL. Inside the datasets pipeline, soundfile threw `Format not recognised`. Arrow's binary cache was silently corrupting the bytes. Fix: ditch `datasets.Dataset` entirely and use a plain Python class with `__len__` and `__getitem__`. PyTorch's DataLoader duck-types map-style datasets. No inheritance required. **Launch 3. DataLoader fork plus LibsndfileError with an exception that can't even stringify itself.** With `num_workers=2`, PyTorch forks worker processes to parallelize decoding. On Modal containers, something about the fork plus soundfile plus pyarrow combination triggered `LibsndfileError: `. A C-string from libsndfile that couldn't be converted to a Python string. Fix: `num_workers=0` for the smoke test, revisit parallelization once the baseline is confirmed. **Launch 4. The one where I actually read my code carefully.** After launch 3 I stopped running anything. Spawned a subagent for a thorough review, then a second subagent cross-referencing the current HF Whisper fine-tuning blog, the PEFT 0.13.2 release notes, and community reports on Whisper-turbo quirks. That second review found 6 more issues the first missed: 1. `tokenizer=feature_extractor` is deprecated in transformers 4.46. Should be `processing_class=processor`. 2. `load_best_model_at_end=True` is silently broken with PEFT. It looks for `model.safetensors` which PEFT doesn't save. PEFT only writes `adapter_model.safetensors`. 3. `generation_config` set on the PEFT wrapper doesn't always propagate to the base model in transformers 4.46. Set it on both. 4. `enable_input_require_grads()` hooks the wrong module on Whisper. The first trainable op going backward is a conv, not an embedding. Need an explicit forward hook on `encoder.conv1`. 5. `num_workers=0` wastes 30–50% of A10G time on CPU decoding. Fine for the smoke test; fix before scaling. 6. Labels need `truncation=True, max_length=448`. An overlong transcript would silently crash the decoder. The lesson here has nothing to do with being bad at this. A 200-line training script on a rapidly-evolving stack (transformers 4.46 shipped October 2024, PEFT 0.13.2 shipped November 2024) has more surface area than I can keep in my head. Pair-reviewing with two fresh agents, one reading the code and one cross-referencing docs, caught things a single read with a single reviewer would have missed. **Launch 5. Training actually starts, then gets killed.** The run progressed past setup, wandb logging kicked in, and I saw the loss bar ticking up. I also saw hundreds of log lines like `[dataset] skip row 6814: LibsndfileError: Format not recognised`. Between my skip-on-decode-error safety net and `num_workers=2`, training was silently skipping a huge fraction of rows and retrying. Overhead plus GPU eval was projecting to ~2 hours and ~$2.35. The smoke was supposed to cost ~$0.30. I killed it after 6 optimizer steps. Diagnosis: the corruption only appears when DataLoader uses multiple workers. `num_workers=0` had decoded 200/200 clips cleanly earlier. So the bytes were being corrupted specifically by the worker-fork path. PyArrow's `.as_py()` returns what looks like normal Python bytes, but the underlying memory may share with Arrow's buffer pool in a way that doesn't survive `fork()`. Fix, once you know the mechanism: force a deep copy into Python-owned memory at row-construction time (`bytes(bytearray(x))` goes through a mutable intermediate so CPython actually allocates new bytes instead of returning the same object). **Launch 6. Training worked, then died at the first epoch-end eval.** This one hurt because the model was actually learning. Loss dropped cleanly from 4.23 to 1.5 over 210 steps. Grad norms calm. Learning rate schedule doing the right dance. Then it hit epoch 1, Seq2SeqTrainer called `trainer.evaluate()`, and the exact same dtype error from Launch 1 came back through a different code path. During training, `bf16=True` activates a `torch.autocast` context around the forward pass; inputs get auto-cast to bf16 to match the model weights. During eval, `Seq2SeqTrainer.prediction_step` calls `model.generate()` and there is no autocast context there by default. Inputs stay fp32, weights are bf16, conv1 says no. The first code review flagged this as a theoretical risk. My fix was "load the model in bf16," which addresses half the mismatch but leaves the inputs fp32. Watching loss drop for 10 minutes and then losing the run to a dtype bug at the epoch boundary is character-building. **Launch 7. Subclass `Seq2SeqTrainer`, wrap `prediction_step` in `torch.autocast("cuda", dtype=torch.bfloat16)`.** Four lines of code. If this completes we get our first real post-training WER number. Cost so far across all launches: ~$1 total. ### T1 smoke, the result > **Zero-shot WER: 124%** > **Epoch 1 WER: 73.2%** > **Drop: 51 percentage points, ~$0.55 of GPU time.** The first real number came in after 250 training steps. A 51-point drop in one epoch is the curve that tells you the model actually has the capacity to learn Kalenjin. It just needed to be pointed at the data. 73% WER is still not deployable. Roughly three of every four words are wrong. But the failure mode changed entirely. Pre-training, Whisper was emitting Icelandic-looking mush and infinite `njiawe ndo` loops. After one epoch of LoRA, it's producing actual Kalenjin-shaped sequences that just aren't quite right yet. The difference between guessing randomly and speaking your language poorly is the entire battle. Final numbers across three epochs:
  Zero-shot Epoch 1 Epoch 2 Epoch 3
WER 124% 73.2% 70.3% 68.8%
Δ −51 pts −3 pts −1.5 pts
eval_loss 1.28 1.22 1.21
Textbook LoRA curve. A massive epoch-one drop, then compounding slows to a crawl. T1 was at diminishing returns by the end. To move further I needed more data, more LoRA capacity, or both. The point of the smoke was never to hit deployment quality. It was to prove the whole stack could train a Kalenjin-aware model for one dollar. Done. ![Kalenjin scripted Word Error Rate across four LoRA fine-tuning iterations: Zero-shot 124.0%, T1 smoke 68.8%, T2 60.5%, T3 (the production model) 56.2%.](/blog/kalenjin-wer-journey.png) ### Eyeballing the transcripts Looking at actual (reference, prediction) pairs after the fine-tune is where the story gets interesting. The failure modes have structure. **Phonetic neighborhood substitutions, roughly 60% of errors.** The model hears the right sound, picks the wrong letter: - `Sokek` → `Sagek` (o/a confusion) - `kenyisiek` → `kenyishek` (s → sh) - `Kiribchin` → `Kiriptin` (b → p) - `mugunkok` → `mugungok` (k → g) These are phonemes that sit on a continuum Kalenjin speakers distinguish but the model's tokenizer doesn't have enough signal to lock in yet. More data plus more LoRA capacity should fix this. **Rare proper nouns.** English names the model hasn't seen get mangled or skipped entirely: `Nyantakyi`, `Kalyango`, `Haaland, Bale, Giroud, Eriksen, Suarez, Shaqiri`. One long clip ending in six footballer names had its output simply cut short. Expected. Nothing a tuned adapter on smaller training data can do about arbitrary rare names. **Numbers translated into words.** When the model sees the number 29, it writes `tuptem ak sakol` — "twenty and nine" in Kalenjin. A date like 20.08.2018 comes out as `tarek tuptei arabet ab sisiit kenyit ab elboeng ak tamat ak sisiit`. Per word-error-rate these are total misses. They're also arguably correct transcriptions of what a Kalenjin speaker would say aloud. The training data clearly contains a lot of spelled-out numbers, and the model learned to mirror that convention. WER penalizes this heavily. Real-world users might love it. **Common English loanwords.** Handled well. `Tanzania`, `Twitter`, `Ukraine`, `Belarus` all come through cleanly. The model is learning code-switching on the frequent ones. ### What this says about T2 The error distribution changes the T2 priority. I'd planned three variants (language token A/B, LoRA target expansion, and 10x data) and thought I'd run them in parallel. After seeing the real errors, the story simplifies: - Phonetic errors = capacity problem → expand LoRA targets from `q,v` to `q,k,v,out,fc1,fc2` - Coverage errors (rare nouns, uncommon words) = data problem → train on 5–10x more data - Language token (`sw` vs `en`) = probably low-impact now that the model is already in Kalenjin-mode. Skip the A/B. T2 combines both promising changes in one run on an A100. ### T2, complete T2 used 20,000 clips (5x more data), expanded LoRA targets (`q, k, v, out, fc1, fc2` instead of just `q, v`), and ran on an A100 40GB (faster and cheaper per step than the A10G used for T1).
  Zero-shot T1 final T2 epoch 1 T2 epoch 2 (best) T2 epoch 3
WER 124% 68.8% 62.1% 60.5% 62.3%
eval_loss 1.21 0.95 0.90 0.89
T2 beat T1's final number by 8.3 points across the same 3 epochs. It also overfit in epoch 3. WER ticked up by 1.8 points while eval_loss kept dropping, a classic sign the model was memorizing training patterns. Because `load_best_model_at_end` is known-broken with PEFT in transformers 4.46, I manually promoted the epoch-2 checkpoint as the "best T2" adapter instead of trusting the trainer's default save. Lesson: with expanded LoRA capacity, you overfit faster at the same epoch count. Either train fewer epochs, scale data proportionally, or add regularization. ### The base64 bug The most interesting finding of the whole project came while investigating training-time decode errors. The dataset loader had lines like `[dataset] skip row 6814: LibsndfileError: Format not recognised` sprinkled throughout training. The safety net in my `__getitem__` (retry next row on decode failure) kept training stable, but ~2% of rows were being silently dropped. A full sweep of two "bad" shards revealed the cause. Healthy rows had `magic=52494646` — ASCII for `RIFF`, the standard WAV file header. Failing rows had `magic=556b6c47` — ASCII for `UklG`. That's not random corruption. `UklG` is the first four characters of what you get when you **base64-encode** the byte sequence `RIFF`. Someone's upload pipeline had accidentally ASCII-encoded raw binary before writing into the parquet's `audio.bytes` column. The fix was three lines: ```python if audio_bytes[:4] == b"UklG": audio_bytes = base64.b64decode(audio_bytes, validate=False) wav, sr = soundfile.read(io.BytesIO(audio_bytes)) ``` 100% rescue rate on every base64-encoded row across the shards I checked. That's ~373 clips recovered in the 20k subset I was training on, and roughly 4,500 clips that would be lost in a full-dataset run. Worth every minute of the deep-dive. Broader lesson: when your training pipeline is logging errors, actually read them. The safety net buys you time to finish the run. It doesn't fix the data. Real datasets in the wild have real upload pipeline bugs. ### T3, full scripted run The T2 postmortem gave T3 a clear recipe. Same expanded-target LoRA, same A100. Train on all 41 scripted shards instead of a 20k-clip subset. Cut from 3 epochs to 2 to head off the overfit pattern we'd just watched happen. Total: ~9,962 optimizer steps, one A100-40GB, no other architectural changes. T3 finished cleanly. Loss curves were calm, eval WER improved monotonically, and the last checkpoint (`checkpoint-9962`) was the best. No late-epoch climb this time. Dropping from 3 to 2 epochs was the right call. I traded a tiny slice of under-fit for not having to manually promote an intermediate checkpoint again.
  Zero-shot T1 final T2-best (ep 2) T3-best (ep 2/final)
Scripted WER 124% 68.8% 60.5% 56.20%
eval_loss 1.21 0.90
4.3 more WER points on scripted for roughly the same training cost as T2. The curve is still bending, not flat yet, which is useful information for the whitepaper. The obvious next move would be more data, except "more data" on the scripted split doesn't exist. We already trained on all of it. ### Meeting real speech for the first time Up to this point every WER number in the project was measured on `dev_test/scripted` — people reading sentences into a microphone. That's the easy half of the dataset. The hard half is `unscripted`: spontaneous speech, real conversational pacing, `[cs]English phrase[cs]` markers where the speaker code-switches mid-sentence, `[pause]` tokens where they pause to think. We had not evaluated on this split even once. 198 clips, both models. Numbers below are at the **G-seq recipe** (sequential decoding, raw refs), which is what the project shipped with first; the canonical recommended-recipe (B5-chunk) numbers are in the verdict table later in this post.
  Scripted WER Unscripted WER (G-seq raw)
T2-best 60.5% 87.64%
T3-best 56.20% 83.53%
On unseen speech types, T3 still beats T2 by about 4 points. Good. The improvements generalize. But 83% WER is the kind of number where you stop celebrating and start asking what's actually going wrong. ![Scripted vs unscripted WER for T2-best and T3-best. T2: 60.5% scripted vs 75.4% unscripted. T3: 56.2% scripted vs 65.6% unscripted. T3 closes the domain gap from 14.9 to 9.4 percentage points.](/blog/kalenjin-scripted-vs-unscripted.png) A 30-sample inspection of T3's predictions on scripted clips showed the model is closer than WER suggests. Errors cluster into two phonetically-reasonable categories: - **Word-boundary splits.** `Chopchinkee` (reference) vs. `Chopchin gei` (prediction). Same sound, whitespace in a different place. WER counts two edits. - **Orthographic variants on proper nouns.** `Osinia` vs. `Bosnia`, `Kadribe` vs. `Candreve`. The model hears a plausible name, it's just not the one the transcriber chose. To a Kalenjin-speaking ear the predictions sound right. To `jiwer.wer`, they're mistakes. Some of that 83% is real domain gap. Some of it is the metric punishing faithful transcription. ### The normalizer experiment Before spending more money on Phase 4, I wanted to know how much of the unscripted number is the model genuinely failing and how much is formatting mismatch. The transcribers decorate unscripted references with `[cs]...[cs]` markers around English code-switches and `[pause]` tokens where the speaker paused. The model doesn't emit any of that markup. Every `[cs]` tag in a reference is a guaranteed substitution error regardless of how well the model transcribes the surrounding speech. I wrote a small normalizer ([`normalize_unscripted_wer.py`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/normalize_unscripted_wer.py)) that strips `[cs]` tags while keeping the inner text, drops disfluency tokens, and normalizes punctuation. Then re-scored. **Recipe: G-seq** (sequential decoding, normalized refs+preds).
  Raw WER Normalized WER Δ
T3-best 83.54% 79.67% −3.87 pts
T2-best 87.64% 84.93% −2.71 pts
79 out of 198 references (40%) contained `[cs]` tags; 2 had `[pause]`. The normalizer claws back a few points but not many. The markup is a measurable tax, not the main story. Roughly 80% of the remaining WER is a real domain gap between scripted read-speech and spontaneous speech. Different pacing, different vocabulary, different everything. Data cleaning isn't going to fix it. Phase 4 needs actual unscripted training data, and we can stop second-guessing the evaluation. ### English regression, now with a third data point Revisiting the catastrophic-forgetting question across all three adapters on 50 LibriSpeech clips:
Model English WER Δ vs. base
Base whisper-large-v3-turbo 12.68%
+ T1 smoke adapter (4k clips, q/v) 16.39% +3.71 pts
+ T2-best (20k clips, expanded LoRA) 16.94% +4.26 pts
+ T3-best (all scripted, expanded LoRA) 17.70% +5.02 pts
Five WER points worse on English over the course of the project. That sounds worse than it is. Looking at the actual predictions, the diffs are almost entirely punctuation and capitalization — commas, periods, hyphens in compound words (`queen-mother` vs `queen mother`). The model is still transcribing the English words correctly, just decorating them differently than the LibriSpeech references. Stylistic drift, not catastrophic forgetting. Production systems can normalize this away, and a small English-replay mix in Phase 4 should take most of it back. ![English LibriSpeech WER across base Whisper (12.68%), T1 smoke (16.39%), T2-best (16.94%), and T3-best (17.70%) — a ~5 point regression, mostly punctuation drift.](/blog/kalenjin-english-regression.png) ### Shipping to Hugging Face Both T3 artifacts are published: - [`Tonykip/whisper-kalenjin-lora-v3-turbo`](https://huggingface.co/Tonykip/whisper-kalenjin-lora-v3-turbo). The LoRA adapter by itself, ~50 MB. Compose with the base model at load time. - [`Tonykip/whisper-kalenjin-v3-turbo`](https://huggingface.co/Tonykip/whisper-kalenjin-v3-turbo). The full merged model, ~1.6 GB. Drop-in replacement for `openai/whisper-large-v3-turbo`. Both MIT-licensed, matching the base. Attribution goes to the Anv-ke/Kalenjin dataset authors for the audio. The released artifacts ship only model weights, no dataset text or audio. The dataset is gated on HF, so anyone who wants to reproduce has to request access themselves. Small bug caught shortly after publishing: `adapter_config.json` had `task_type: null` instead of `SEQ_2_SEQ_LM`. PEFT's auto-loading works either way for inference, but the null value made some tooling skip adapter-specific code paths. Patched directly on the Hub and fixed in [`modal_train.py`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/03-training/modal_train.py) so future runs won't repeat it. ### Microsoft Paza: the benchmark ran, the ceiling didn't hold While this project was running, Microsoft released [`paza-whisper-large-v3-turbo`](https://huggingface.co/microsoft/paza-whisper-large-v3-turbo) covering five East African languages including Kalenjin. Obvious thing to do: compare. Less obvious: actually getting a usable number out of it. The earlier attempts had failed in predictable ways. `language="sw"` made Paza hallucinate Swahili and auto-detect Kikuyu (`<|kik|>`) mid-output. 916% WER, not a fair measurement. `language="kln"` crashed outright because `transformers.models.whisper` hardcodes the original 99 Whisper languages in `language_to_id`, and Paza's Kalenjin token extension isn't in that map. Fix: bypass `language=` entirely, construct `decoder_input_ids` as raw token IDs. **Attempt 1. No timestamps.** Built the prompt by hand: `[<|startoftranscript|>, <|kln|>, <|transcribe|>, <|notimestamps|>]`. Added `repetition_penalty=1.3` and `no_repeat_ngram_size=3` to suppress Whisper's well-known hallucination tails. Ran it on the same 198 unscripted clips we'd been benching T3 against. **Result: 112.55% WER.** The story in the transcripts is revealing. Sample 1's REF starts `Mianwogik Che nootin...` and Paza's PRD starts `mionwogik che nootin...`. Nearly perfect on the first phrase. Then the output keeps going. It hallucinates past natural silence, drifts into Swahili, emits doubled `<|kln|><|kln|>` language tokens in the middle of sentences, eventually lands in word salad. The first sentence is genuinely Kalenjin. Everything after is a lottery. **Attempt 2. Timestamps plus decoder attention mask.** This is the canonical fix for Whisper's "keeps generating past end of short clip" failure mode. Dropped `<|notimestamps|>`, set `return_timestamps=True`, passed `decoder_attention_mask=torch.ones_like(decoder_input_ids)` so the forced prefix tokens get proper attention. **Result: 99.14% WER.** Marginal. New failure modes made themselves visible: sample 7 terminated immediately with empty output after `<|kln|><|kln|>`; sample 8 fell into a tight repetition loop (`kibuko kibuki kibu kibo`); sample 9 mid-sentence language-switched to `<|kik|>`. Different pathologies, same ceiling. Two attempts, roughly $1–2 in compute, called it. Honest conclusion: Paza is clearly Kalenjin-capable. The opening phrase on most clips is coherent and recognizable. But with stock `transformers.generate()`, it hallucinates past audio end and/or falls into repetition loops. A real ceiling measurement would need either Microsoft's internal inference recipe (not published) or VAD-based audio trimming to cut silence tails before generation. Both out of scope. What this means for positioning: I cannot honestly claim "we beat Paza." I can say this. The LoRA approach here, vastly smaller and simpler, produces parseable Kalenjin output out of the box at **65.56% normalized unscripted WER under the recommended recipe** (or 79.67% under the G-seq recipe Paza was effectively benchmarked at — pick the comparison that matters to you). Paza has a higher capability ceiling that I couldn't measure without more work. T3's numbers stay as the model-we-can-measure baseline. ### Phase 4: the $3 probe that said no With the unscripted-domain gap confirmed real, Phase 4 is a second LoRA warm-started from `t3_best_adapter` and trained on unscripted data. The shape of the problem, going in: - **The unscripted split is 3.4x the size of scripted.** 137 shards / 97.76 GB, vs. 41 shards / 28.90 GB for scripted. A full 2-epoch pass over all of it runs ~$50–70. A lot to bet on one curve. - **Gains on unscripted were uncertain.** Honest prior from the pattern of T1→T2→T3 and the 79.67% baseline: expected outcome (~55%) was 58–66% normalized unscripted WER, a 15–25 point drop. Home run (~20%) was 50–55%. Modest (~20%) was 67–73%. Flat-or-regression was a real ~5% tail. - **Known side effects.** Scripted WER will regress; T3 stays the scripted specialist. English regression gets worse (already at +5 points). Diminishing returns on data — the last 60 shards probably buy only 3–5 WER points beyond the first 40. So the decision: don't commit $50+ upfront. Run **T4a first**. 40 unscripted shards (the first ~30% of the corpus), warm-started from T3, 2 epochs, learning rate dropped to 5e-5 (T3 used 1e-4; halving it to keep the warm-start from blowing away T3's scripted knowledge), LoRA r=16 on `q/k/v/out/fc1/fc2`, batch 12 on an A100-40GB, `normalize_text=True` baked into both training labels and eval so neither side gets penalized for the `[cs]`/`[pause]` markup tax. Budget: ~$15–18, ~6–10 hours. One change worth naming. The text normalizer that was used only for post-hoc rescoring is now baked into the training pipeline. `_KalenjinDataset.normalize_text=True` strips `[cs]` and `[pause]` markers from labels before tokenization, so T4a won't learn to emit markup the model doesn't need. Rows that normalize to empty strings get skipped by the existing bad-row fallback. The whole point of measuring the markup tax in the normalizer experiment was to decide whether to bake it in. Now it's baked in. Two small footnotes from getting it launched: - **Modal rejects 48-hour timeouts.** First T4a submission set `timeout=172800` (48h). Modal's hard cap is 86400s (24h). It failed at submission, not partway in. Dropped to 24h and relaunched. - **HF Hub dropped a connection mid-shard-pull.** First T4a run died at ~20/40 shards with a `ChunkedEncodingError`. Added a 5-retry loop with exponential backoff around the shard download. It resumed cleanly, skipping the 20 shards that had already landed. Data infrastructure always has edge cases, and you find them at the worst possible time. ### T4a, the result T4a ran. It didn't work. Training loss started at 2.40, plateaued in the 1.7–1.9 band through most of epoch 1, dipped briefly into 1.5–1.7 at the start of epoch 2, and ended around 1.6–1.8. The 1.2–1.4 range I'd been hoping for mid-training never arrived. 50 minutes wall time, ~$3 total including the shard pull. Much cheaper than the $15–18 I'd budgeted, because the plateau showed up so early there was nothing to wait for. Final eval on the same 198-clip unscripted dev shard. **Recipe: G-seq** (sequential decoding, normalized refs+preds — what was current at the time of T4a).
Model Scripted WER Unscripted WER (G-seq norm) eval_loss
T2-best 60.5% 84.93% 0.90
T3-best 56.20% 79.67%
T4a (40 shards, warm-start, LR 5e-5) 80.63% 2.12
T4a came in ~1 WER point worse than T3 on unscripted. Inside the noise band for 198 clips. Call it flat with a slight regression, not a definitive "worse." The 15–25 point drop I was hoping for is nowhere in that number. T4a landed squarely in the 5% flat-or-regression tail of the prior distribution. This is what probe runs are for. $3 of honest signal beats burning the remaining ~$50 on T4-full discovering the same ceiling, or worse. ### Why it might have flopped Several plausible root causes, none provable without more runs: 1. **Learning rate still too high.** 5e-5 was already half T3's 1e-4. Half may not be enough when warm-starting into a different domain. Something like 1e-5 would let the adapter drift more gently instead of thrashing. 2. **Catastrophic forgetting wins at this data scale.** 40 unscripted shards is enough to start erasing T3's scripted-learned patterns but not enough to learn unscripted-optimal ones. Model ends up stranded between two local minima. Specialist in neither. 3. **Loss plateau at ~1.7 says the optimizer found a basin early and stopped.** If that basin isn't meaningfully better than T3's starting point, every step after was wasted compute. 4. **The domain gap may be bigger than hyperparams can close at this data scale.** Maybe the full 137-shard run would have broken through. The probe can't rule that in, only out. Most likely it's some mix of 1 and 2. Either way, throwing more shards at the same recipe won't fix it. ### T4b and T4c, two more probes T4a told me something was wrong with the recipe. So I ran two more probes, same 40 shards, each testing a different hypothesis. **T4b — lower LR.** Same 40 unscripted shards, warm-start from T3, but LR 1e-5 instead of 5e-5. Tests "is the adapter drifting too fast." Cost: ~$3, ~50 min. **Result: 81.42% normalized WER.** Worse than T4a. Lower LR didn't slow drift in a useful way. It barely learned anything new and still walked slightly away from T3's basin. Two variants, neither better than T3. **T4c — scripted replay.** Same 40 unscripted shards plus 10 scripted shards mixed in, 80/20 by shard count. Same LR as T4a (5e-5). Tests the catastrophic-forgetting hypothesis directly. Cost: ~$5, ~80 min (scripted shards have more rows per shard, so step count grew from 1664 to 4268). First run crashed at step 2416 when my local Modal client lost heartbeat. Only checkpoint-1500 was saved. Relaunched it clean. Second attempt completed. **Result: 79.64% normalized WER.** T3 is 79.67%. That's a tie at the G-seq recipe.
Model Config Normalized unscripted WER (G-seq)
T3-best scripted only, no Phase 4 79.67%
T4a 40 unscripted, LR 5e-5, warm-start 80.63%
T4b 40 unscripted, LR 1e-5, warm-start 81.42%
T4c 40 unscripted + 10 scripted, LR 5e-5, warm-start 79.64%
T4c's internal training eval read 79.32%, which looked like a 0.35-point win when I first saw it. The standalone eval pipeline reproduces it at 79.64%, same as T3. At G-seq, T3 and T4c are indistinguishable. *That conclusion is overturned later in this post once the recipe is upgraded — the verdict table at the recommended recipe (B5-chunk) shows T3 at 65.56% and T4c at 77.42%, an 11.86-point gap. Dialect stratification at the recommended recipe further surfaces a 22-point T4c skew toward Nandi.* ![Unscripted Kalenjin WER at the recommended recipe (chunked + beam=5): T3 (shipping) 65.56%, T2-best 75.37%, T4c 77.42%, T4a 79.86%. T3 wins by 9-14 WER points; every Phase 4 attempt to beat T3 made things worse.](/blog/kalenjin-phase-4-comparison.png) ### Listen and judge I almost missed this. Playing the first clip on my phone after everything had shipped, the T3 transcript looked fine next to the short REF I had displayed — until the audio kept going. The clip is 60.6 seconds. The REF I had pasted was one sentence. The real REF is six sentences, and T3 had transcribed the first two. That sent me back to measure coverage across all 198 clips:
Metric T3 T4c
Mean coverage (predicted words / reference words) 68% 94%
Median coverage 67% 97%
Clips with <60% coverage (severely truncated) 90 (46%) 32 (16%)
Clips with >140% coverage (over-generating) 1 (0.5%) 13 (6.6%)
And by REF length:
REF length n T3 cov T4c cov
< 30 words 27 92% 97%
30–59 45 91% 104%
60–89 74 59% 99%
90–119 41 50% 80%
≥ 120 11 40% 63%
T3 is silently dropping about half the audio when clips get long. On long-form speech (60+ words of reference, typically 30+ seconds of audio), T3 stops transcribing around the halfway point. T4c keeps going. That has a mundane cause: Whisper's audio encoder has a 30-second native window, and my eval pipeline doesn't chunk long audio into overlapping 30-second passes. Inside that constraint, T3 stops cleanly at the end of what it sees. T4c, trained with `[pause]` markers stripped from labels, learned to keep generating through silences instead of treating them as stop signals. But T4c's extra coverage comes with a catch. On several clips it gets most of the real content and then falls into a repetition loop — `"kora kora kora kora…"` for another 20 or 30 words. Clips 2, 5, 7, 8, and 10 below show this pattern clearly. It's a classic Whisper hallucination tail: the decoder keeps going past the audio and picks a token it likes the sound of. Both models are imperfect. The failure modes are different in a way WER can't distinguish, but a human listener can. T3 gives you the first half of what was said, clean. T4c gives you most of what was said, with a sometimes-garbled tail that's trivial to crop in post-processing. Read the pairs below with that in mind. The REFs are full — nothing truncated, nothing cleaned up. The durations, REF word counts, and coverage percentages are labelled so you can see exactly where each model stops tracking the audio. **[1] (60.6s · REF 92w · T3 56w=61% · T4c 52w=57%)**
REFMianwogik Che nootin chebo ingokenik ko miando nebo amaldaa nenomdoogee ingokyet kongeten ingokyeet age bokoi age.Mionitok koyon kokochut igere kobirto meet ingokyet I akonaam kochirireem akorue kityok en betut tugul amomuche okot kowendot.Imuch kobar ingokenik eleo en sait agetugul,en saisiek tuten ko kakobaar ingokenik chechang.Ele kinyoito miondo neunitok koyoche iweibu kerichek ak inyon isochi,kisochin serunek I kotoretii kora kosto mionotok yon kokochut.Ne sir e missing koyoche kochonchonii chito ingokenik en kila arawek somok kochanchan sikobiit komachut mionotok.Miten oretage nekeboishien nebo kipgaa neimuch ichulchi beek ak tangorotwet kosto kora mionotok akokochi ingokenik konyoor chemetabgee.
T3Mionwokik chenootin chebo ngogenik ko miondo nebo amalda ne inomdoi eng' ngokyet kong'eten ngokyet agenge akoi age sanasana ko yon bendot ingogenik. Mioniton ko yon kokochut igere kobirto meti ngokyet ak konam kochoriren ak korue kityok eng' betut tugul amamucho kot kwendot.Imuch kot kobar ingogenik che leu en sait agenge en saisiek tuten kokakobar ingogenik chechang.
T4cMionwokik chenootin chebo ngokenik ko miondo nebo amalda neinomdogei ngokiet kongeten ngokiet agenge akoi age sanasana koyon bendote ngokenik Mioniton ko yon kakochut igere kobirto meti ngokiet akonam kochoriren akorue kityok en betutuk kole mamuche okot kwendot Imuch okot kobar ngokenik che leu en sait agenge anan saishek tuten kokakobar ngokenik chechang
**[2] (46.5s · REF 101w · T3 52w=51% · T4c 96w=95% + repetition tail)**
REFBatainik ak chepokitkit keribe kipagenge [cs] sanasana [cs] inyoru chito kogabai chepokitkit ak kotesta batainik. Komoche ripet neo mising moche amitwogik chemi barak, moe ko ele rue koburgei mising, ak moyoche komanda koba ole loo. Ak ngandan batainik komoe ele uwon kotwonit asikobit konyor tuguk cheu kanyitonik kwam amun amitwogik kywak che chome mising. Ogere ko roisi ripet ab ngokenik kosir nebo chepokitkit ak batainik, ingokenik ko malazima ibeiolchi amitwogik en tuget ak imuche ibaen kora kityo amitwogik chebo gaa yon memouche ial chebo tuget, ngokenik ko imuche ityakte kityo kobendat ak kochengige amitwogik, ak momoe ripset neo kou chepokitkit ak batainik
T3Batainik ak chebokitkit keribe kibagenge sanasana,inyoru chito kobai chebokitkit ak kotestab batainik,komoche ribet nekimising,moche amitwogik chemi barak,moe ko ele ruwe koburgei missing ak moyoche komanda kobo elelu,angana batainik komo ele uon katuonit asikobit konyoru tuguk cheu kanyitonik koam amu amitwogikchwak chechob mising.Akere kora isir ribet ab ingogenik kosir nebo chebokitkit ak batainik.
T4cBatainik ak chebokitkit keribe kibagenge sana sana inyoru chito kokabai chebokitkit ak kotestai batainik komoche ribet neki mising moche amitwokik chemi barak moche ko ele ruwe koburgei mising ak moyoche komanda kobai leloo nganan batainik komoche ele uon kotuanit asikobit konyoru tukuk cheu kanyitoni kwam amun amitwokikwak chechome mising akere koroisi ribetab ingokenik kosir nebo chebokitkit ak batainik ak kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora
**[3] (32.5s · REF 56w · T3 56w=100% · T4c 105w=188% — repetition tail)**
REFKit neyoe ingokenik missing koam mainik yon kokolok i,ko tuguuk che rortotiin en bortoo chekikuuren calcium [cs].Yon kokobek tuguchotet chetoretii ingokenik kokimitee konome ingokenik koame mainik.Kitage kora ko tabiet anan atebet nekakonaitaa ingokyeet nekwokoloku konaam kwam mainik kotesetai anyone koame mainik en kila ak kila.Yon miten ingokenik chekeren koame mainik komuch koib atebonotet koam akichek mainik.
T3Kit neyae ngogenik mising kwa mainik yon kakolog ko tuguk che rartotin en borto che kikuren calcium,yon kakobek tuguk chotet che toreti ngogenik kokimite koname mainik,kit age kora ko tabiet anan atebet nekakonaita ngokiet en bogol logu konam kwa mainik,kotesetai anyun koame mainik en kila ak kila.Yomiten korak ngogenik che keren koame mainik komuch koiboto maatik.
T4cKit neyae ngokenik mising ko maainik yon kakolok ko tukuk che rartotin en borto che kikuren calcium yon kakobek tukuchotet che toreti ngokenik kokimite konome maainik ngokenik kwome maainik kit age kora ko tabiet anan tebet ne kakonaita ngokiet en bokol loku konam kwa maainik kotesetai anyun kwome maainik en kila ak kila yon miten kora ngokenik che keren kwome maainik komuch koiboto maainik kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora
**[4] (49.5s · REF 117w · T3 56w=48% · T4c 98w=84% + repetition tail)**
REFOreet [?] yoon kakekes en mbar koindoi koot nenaat kekuren chokoo, indoi yotet ak ikonor koyomnyo kokut koristo sikobit kokororonekitu ak kobur anyun kasarian kiten imuch ibir anan ibor iboren chebore nekikoit en kasari , yoon koibor anyun inde kinuok chekikoit chebo kasari chemalasima okot inde kerichek , indoi kinuo chotet sikobit komangem tiongik cheuu susurik ak kiit neyoe kotom inam inde kinuok chotet konyolu korong imaa iker ilee koyomnyo ak isach ak inde kinuochotet isach komongemokse anyun oo ngetu kokororon, ko biik chemoboishen kinuo chotet komuch konde kerichek ak ichek ak kokonor en choko elee naat ko koot nekararan mising neimuch korib bandek en kasarta neko ak imuch kobur kabisa kokakee ak konget kokororon en yotet.
T3Oret elemonit nekolepian kakekesen imbar koinde ibot manas kekerer koget,inde yotet akikoler koyonyo kobit beriton sikogit chekororon ak itin,iburanyun kasari ak kiten,imuch iwer anan ibor,iboren chebore nekikoiten kasari,ango iboranyun indegenuok chekikoit chebo kasari chemo lasima ak kotinde kerichek indegenuok chotet asikobit komong'em tiongik cheu susurik.Ak kit neyoe kotominam indegenuok chotok konyolu korong'in maa iger ele koyobo.
T4cOret ole maat nekaketyon kakekeseen imbar ko indoi kot manat kekereen kowek Indoi yotet ak ikonor koyomyo kobit koriton sikobit chekororon ak itin Ibure anyun kasari ak kiten imuchi iwer anan ibor iboren cheibure nekikoiten kasari anan koibor anyun inde kinuok chekikoit chebo kasari chemo lazima akot inde kerichek inde kinuok chotet asikobit komang'em tiong'ik cheu susurik Ak kit neyae kotomo inam inde kinuok chotet konyolu korong imaa iger ile koyomyo kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora k
**[5] (53.7s · REF 124w · T3 63w=51% · T4c 92w=74% + repetition tail)**
REFYon omoche anam abai ngokenik ko kit neokere netai koyoche otindoi ole anyon aboen ngokenik chotet. Yoche ko korochob kot ne korataban ne kikuren kapingoken neimuch koteben ngokenik ak ke ripe en yotet sikonyor chametab ge nekararan. Kit age neokere kora ko amitwogik che nyon kwome ingokenik. Yoche kebai ngokiet sikobit kenyorunen kelchin en maainik ko kit ne ogere nebo oeng kora koyoche komiten amitwogik chenyon koame kotko chebo gaa anan che yoche aal en tuget. Tuget neyoche alen [cs] chakula [cs] ichotet anan amitwogik chotet koyoche ataban rabinik che yomege ak amitwogik chotet. Nyolu ager kora ole miten beek cheimuche koye ngokenik kosibge ak notet koyoche komoche koye ngokiet beek che kororon che tililen sait age tugul, sikobit konyor chamet ab ge nebo bortonyin.
T3Yon amache anam abaa ingogenik,kobiit ne ageren netab koyoche atindoi olenyon abo en ingogenichotet,yache kokorochob kot nekora taban nekiburen kap ingogen,neimuch kotoben ingogenik ak keriben yotet sikonyor chametabge nekararan.Kit age kora ne agere ko amitwogik che nyon ngwome ingogenik,yache keba ingokiet asikobit kenyorunen kelchin en maainik,ko kin ageren nebo oen kora koyoche komiten amitwogik che nyon ngwome,ngotko chebo gaa anan che yonche alak.
T4cYon amache anam abaa ingokenik ko kit neokere netab koyoche atindoi ole nyon aboen ingokenik chotet yache kokorochob kot nekora taban nekikuren kap ingokenik neimuch koteben ingokenik ak keriben yotet sikonyor chametabkei nekararan Kit age kora neokere ko amitwokik chenyon koame ingokenik yache kebaa ingokiet asikobit kenyorunen kelchin en maainik Ko kit neokere nebo oeng kora koyoche komiten amitwokik chenyon koame ngotko chebo gaa anan cheyoche ale ngokiet asikobit kenyorunen kelchin en maainik Ko kit neokere nebo oeng kora koyoche komiten amitwokik chenyon koame ngotko chebo gaa anan cheyoche ale ngokiet as
**[6] (27.1s · REF 50w · T3 45w=90% · T4c 48w=96%)**
REFWoo keljin nekenyorune basetab ingokenik ibakobai chito ingokenik konyoru keljinoik cheu kwalda mainik kii,imuch konyorunen mainik chechang missing ak kwalda anyun konyorunen rabinik yotet ak komuch ketoretengee en sobet nebo Kila betut.kit age netoreti ngokenik kii komii ngokenik chebo panyek komuch koraa koboishen chito panyek chotet kwalda kosigen yotet rabinik.
T3Woo kelchinik kenyorunen baetab ingogenik,ngwaboi chito ingogenik,konyoru kelchinaik cheu koalda mainik,imuch konyorunen mainik chechang missing ak koalda anyun konyorunen rabinik yotet ak komuch kotoretin gee en sobet nebo kila betut,kitage netoreti kora ingogenik komi ingogenik chebo banyek,komuch kora koboisie chito banyechotet koalda kosigen yotet rabinik.
T4cUu kelchinik kenyorunen baetab ingokenik ngwabai chito ingokenik konyoru kelchinoik cheu kwalda maainik imuch konyorunen maainik chechang mising ak kwalda anyun konyorunen rabinik yotet akomuch kotoretinkei en sobet nebo kila betut kit age netoreti kora ingokenik komi ingokenik chebo banyek komuch kora koboisien chito banyechotet kwalda kosigenyotet rabinik
**[7] (42.1s · REF 65w · T3 59w=91% · T4c 92w=142% — repetition tail)**
REFAyoonii ole ingogenik chebendotee konomin miondo missing kosir chekakigeer.Amune koni,angobendat ingokenik komuch kotuyoo ak alaak chemiondoos,ak konaam konamdage miondo nekinomdogeeii.Nenaat ot konebo amaldaa,neyoon kokochut ingokyeet konome kochorireni ingokyeet ak kobirtoo meet.Akomuchi okot kobaar,Kitage kora neyoe ingokenik chebendote koik onion miondoo,konenootok konesungukonii koboishien kimnotet tugul netindii kobendotii ele loo,konotok konyumnyum missing koik che miondos anan konan miondoo ak kibet icheket.Ko oyooni ole kororon ingokenik che kekeere.
T3Ayonii ale ingogenik chebendate ko konomin miondo missing kosir che kakiker amune ko ne ngobendat ingogenik komuch kotuiyek alak che miondos ak konam konamdage miondo nekinomdogei,nenatot ko nebo amalda neyon kakochut ingokyet konomin kochoriren ingokiet ak kobirto metik ak komuch okot kobar.Kitai kora neyoe ingogenik chebendate koek namin miondo ko nenoton ko nesungukoni koboishien ngukut tugul anan kimnotietab miondo.
T4cAyani ale ingokenik chebendate konamin miondo mising kosir che kakiker amune ko ngobendat ingokenik komuch kotuiyek alak che miondos ak konam konamdage miondo ne kinamdogei nenatot ko nebo amalda ne yon kakochut ingokiet koname kochoriren ingokiet ak kobirto meti ak komuch okot kobar Kitok kora neyoe ingokenik chebendate koek nomin miondo ko nenoton ko nesungukoni koboisien ngukut tugul anan kimnotet nebo miondo kochut kora nebo miondo kochut kora nebo miondo kochut kora nebo miondo kochut kora nebo miondo kochut kora nebo miondo kochut kora nebo miondo kochut kora nebo miondo kochut kora
**[8] (49.4s · REF 119w · T3 64w=54% · T4c 91w=76% + repetition tail)**
REFKiit netienge olietab mainik en ndonyo nebo kenya nguni ko tienge kolee unee mainik, kikere ngot ko kororon maini chotet anan kosomisen anan mainik chesomisen kisto en ndonyo chekororon konyoru oliet nemii barak mising, kiit akee nekikere ketebe kelee tos banee mainichu, kimoe kenai ngot ko ngokenik chekoloku ko chebo kipkaa anan [cs] kienyechi [cs] anan ko chekoloku chemobo kipkaa kibeshoen en yotet amun chome biik mising mainik chebo [cs] kienyeji [cs] ak woo mising olienyin anan mii barak kosir chebo kiret, kiit akee olekikere en olietab mainik ketebe kelee toos kingeloku mainichu ko auu, kimoe kenae kelee kikiloku auu, olee biik mainik che [cs] fresh [cs] alaan chekororon chekoto keloku konotet koboru kolee ne beit nekikoin maini chotet.
T3Kit netiengei olyetab mainik en indonyo nebo kenya nguni kotiengei kole unen mainik,kikere kotkororon mainik chotet anan kosomisen mainik chesomisen kistoi indonyo,kik chekororon konyoru olyet nemi barak missing,kit age nekikere ketebe kole tos banen mainikchu,kimoe kenai kotko ngogenik chekolok chebo kipkai anan kenyech,anan ko chekolok ko chemobo kipkaa,kibeshoen yotet amu chame missing bik mainik chebo kenyech ak kwo mising olyo nyin anan kenyek chokok.
T4cKit netienge olietab mainik en indonyo nebo Kenya nguni kotienge kole une mainik kikere ngot ko kororon mainichotet anan kosomisen mainik chesomisen kiistoi en indonyo ak chekororon konyoru oliet nemi barak mising kit age nekikere ketebe kole tos banen mainichu kimwae kenai kot ko ingokenik chekoloku chebo kipkaa anan kenyech anan ko chekoloku chemobo kipkaa kibeshoen yotet amun chome mising biik mainik chebo kenyech ak kwo mising olienyin anan kochome koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa koraa kora
**[9] (48.8s · REF 81w · T3 45w=56% · T4c 55w=68%)**
REFKit neoyoe siariib tililindo en kapingoken i,ko agere alee tililil yotet olemiten ingokenik ak abuche kila tuguchotet chebo ingokenik.Aistoo abuyee,kit age kora neoyoe ko ageer ale mo twon chotet amun ingotwonit komurituu missing.Apongoni kabisa yotet ak apangan tuguuk chekindoo beek,chekikitokos sikobiit komasertaa beek tuguchoton amun kot kosertaa koitwonitee yoteet koyee komurut missing yon kanyanyawan.Moe kora kit nekindoo siratik en orit,siyon koronde en yotet aib koba olemokebelen anan olebokeloktoen.Ondoi Dustbin [cs] chechang yote chomuche Ande saratik anan murindo nemiten yote.Sianaam anyun ak ababeel.
T3Kit neoyesio rib tililindo en ngab ngogen,koageer ale tililiotet elemiten ngogenik ak kobuche kila tuguchotet chebo ngogenik,aistoi abuche.Kitegen ayoe kora koageer ale motuoniotetik amun ngotwonit komuritu mising,abongoni kapsayotet akobangan tuguk chekindo beek chekikitek kosi,sikobit komaserta beek tuguchotet amun kotkosert ko tuonit yotet koyoe komurit kianganyanyawan.
T4cKit ne ayoe si arip tililindo en kap ingoken ko agere ale tililiyotet ele miten ingokenik akobuche kila tukuchotet chebo ingokenik aistoi abuche Kit ake ne ayoe kora ko agere ale matwaniyotet amun ngotwonit komuritu mising abangani kapsayotet akobangan tukuk chekindoi beek chekikitekos sikobit komaserta beek tukuchotet amun ngotkoserta kotwonit en yotet koyoe komurit kianganyanywan
**[10] (48.8s · REF 87w · T3 42w=48% · T4c 103w=118% + repetition tail)**
REFIngogenikyuk kolokuu kila betut,ak ayumii mainik chenekiit ite tamaan en betut.Kosiibge ak ne lokuu ingokenik chechang i,ako atindoi ingokenik chechang.Kit neoyoe yon amoche ayuum mainik,atokyini saisiek che loogu ingikenik.Ak yon kakolook kila betut ingobore lokuu ko koroiste mainiik en kapingogen.Awendi ak ayumm ak onde kit newu indoor anan ko kikaput ak aibu.Yeityoo anyun apangan en tureit sikobiit aalde.Atindoi anyun olindet neoldewon anan neolwon mainik en beit nekararan i,ako aldoi kila en tureisiek.Amuche ot okoito tureisiek oeng kila wikit.Ko tureit agenge kiolunon bogol tisap konotok nonyorunen kelchun missing.
T3Ingogenikyuk koloku kila betut ak ayumi mainik chenekite taman en betut kosiib kongu nelogu ingogenik chechang akotindo ingogenikyuk chechang.Kit neoyoyon amache ayu mainik ko atokyin saisiek che logu ingogenik ak yon kogolok kila betut ngobegu betut ko korosto mainik en kap ingogen.
T4cIngogenikyuk koloku kila betut ak ayumi mayainik chenekit nde taman en betut kosipko ang'un neloku ingogenik chechang ak atindoi ingogenikyuk chechang kit ne ayoe yon amoche ayuu mayainik ko atokyin saishek cheloku ingogenik ak yon kakolok kila betut ngobeku betut kokorosto mayainik en kap ingogen awendi ak ayum anyun akonde kit neu ndoit anan ko kikabut akoibu yeityo anyun abangan en tireit sikobit kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora kora k
A full dump of all 198 clips with T3 and T4c side by side is at [`t3-vs-t4c-unscripted.md`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/t3-vs-t4c-unscripted.md). ### Two more honesty checks — the inference fix, then dialect stratification After everything above, I went back and did two follow-up runs that re-shaped the story enough to flag clearly. **1. Chunked inference + hallucination-mitigation decoding.** The 30-second encoder window was silently truncating ~46% of T3's predictions on long clips. I switched to chunked long-form inference (`chunk_length_s=30`, `stride=5`, `return_timestamps=True`, `compression_ratio_threshold=1.35`, temperature fallback). On the recommended-recipe (described below) coverage on T3 went from 67.8% mean (sequential) to 91.3% (B5-chunk). WER went *down* materially because beam search compounded the chunking fix. **2. Per-dialect WER, the question I avoided asking.** The dataset has dialect labels (KIPSIGIS / NANDI) hiding in the meta.csv. Joining those to the 198 eval clips and computing under the recommended recipe (chunked + beam=5) gives this:
Slice n T3 WER T3 CER T4c WER T4c CER
Overall 198 65.56% 21.10% 77.42% 34.09%
Kipsigis 156 65.51% 21.40% 79.14% 34.62%
Nandi 42 66.08% 20.21% 56.99% 26.32%
All numbers from [`canonical_metrics.json`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/artifacts/canonical_metrics.json), normalized identically across both models (lowercased, `[cs]`/`[pause]` stripped, punctuation collapsed). Recipe: chunk_length_s=30, stride=(5,5), num_beams=5, temperature fallback, hallucination guards. Two findings that matter: **T3 is balanced across dialects.** Kipsigis and Nandi WER differ by 0.57 pts (Kipsigis is *very slightly* better). Same model, both dialects, equivalent quality. CER is ~1 point better on Nandi — small and within sampling noise (n=42 vs 156). **T4c is dialect-skewed by 22 WER points.** It picks up +9 WER on Nandi compared to T3, and loses 14 WER on Kipsigis. Earlier "T4c ≈ T3" framing was based on G-seq normalized scoring; once we move to the recommended recipe and stratify, the asymmetry is much sharper. T4c ships better Nandi at the cost of worse Kipsigis. Not what the project's overall mission would call success. Why T4c skews this way is something I can't fully explain. The most plausible hypothesis: the 40 unscripted shards T4c trained on may have been Nandi-skewed by chance. I haven't checked the per-shard dialect distribution to confirm, and that would be the next investigation if I were continuing. There are other plausible causes (acoustic features, training dynamics) I can't rule out without more runs. What I will commit to: - T3 serves Kipsigis and Nandi speakers comparably on this evaluation set. That's the primary recommendation. - T4c is a research artifact showing that small-sample LoRA fine-tuning can specialize toward one sub-tribe. Worth documenting; not what I'd ship. - The Kalenjin sub-tribes outside Kipsigis and Nandi (Tugen, Keiyo, Marakwet, Sabaot, Pokot, Terik, Sengwer) are not represented as labeled groups in the data. Speakers from some of those communities are *probably* in the training data (counted as Kipsigis or Nandi for labeling purposes), but I cannot evaluate per-sub-tribe performance because the labels don't exist. A speaker from those communities trying the model may see worse results than a Kipsigis or Nandi speaker — and I can't tell them how much worse. That last point is the part I want to be especially clear about. The model and the project carry the name "Kalenjin" because that's what the dataset claims to cover and what the umbrella community calls itself. But the empirical evidence backs only Kipsigis and Nandi as measured groups. Anyone who depends on this for production should hear that as a caveat, not a closing-credits roll. ### The verdict, with full data I rewrote this section three times. Each rewrite came after a check that changed what the data said. The current version is built on every check we ran — chunked inference, CER reporting, dialect stratification, training-data composition audit, and a confirmation pass on a third adapter. The story it tells is much simpler than the path that got me here. **T3 ships as the only Kalenjin model.** Here's why, with the receipts. All numbers are at the recommended recipe (chunk_length_s=30, stride=(5,5), num_beams=5, temperature fallback, hallucination guards), normalized identically across all four adapters. Source: [`canonical_metrics.json`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/artifacts/canonical_metrics.json).
Adapter Overall WER Overall CER Kipsigis WER Nandi WER Dialect gap Coverage
T3 (all 41 scripted, 2 epochs) 65.56% 21.10% 65.51% 66.08% −0.57 0.913
T2-best (20k subset, expanded LoRA) 75.37% 26.59% 75.95% 68.57% 7.38 0.943
T4a (40 unscripted, LR 5e-5) 79.86% 38.03% 81.64% 58.73% 22.91 1.193
T4c (40 unscripted + 10 scripted, LR 5e-5) 77.42% 34.09% 79.14% 56.99% 22.15 1.169
T3 wins on every column that matters: lowest WER and CER by 9-14 points, dialect gap of −0.57 (essentially zero, Kipsigis very slightly better), and coverage near 1.0 with no inflated values from repetition tails. T2 is second; T4a and T4c land last. Both T4 variants over-generate (coverage > 1.16) — they produce more output than the reference contains, primarily through repetition tails on Kipsigis clips. The dialect skew is also at its sharpest under B5-chunk: T4a gives 23 WER points better service to Nandi speakers than to Kipsigis. Better decoding made the training-data-composition bias more visible, not less. **Why T4a and T4c are biased toward Nandi: it's the training data, full stop.** I scanned every row in the first 40 unscripted shards (which both T4a and T4c trained on) and joined to the meta.csv:
Dialect Rows in T4a/T4c training data Share Speakers
Nandi 7,734 77.5% 55
Kipsigis 2,231 22.4% 46
(empty) 9 0.1% 1
Both T4a and T4c saw 3.5× more Nandi audio than Kipsigis audio. They both inherited the bias. T4c's scripted-replay mix added some Kipsigis exposure back through the 10 scripted shards, which compressed the dialect gap from 16.4 → 12.6 points but didn't close it. There's no exotic explanation needed. Models learned what was in front of them. T3 was trained on all 41 scripted shards, which appear to have a more balanced dialect mix (we'd need to scan to confirm exact %s, but the speaker-level numbers from earlier — 68 Kipsigis vs 107 Nandi in train/scripted — suggest closer to dataset-level balance than 22/77). **The lesson worth keeping for the next project.** Shard ordering in the dataset isn't randomized for dialect balance. Training on the first N shards of an unscripted split silently selects for whichever dialect happened to be uploaded first. For any future low-resource fine-tune, either train on the entire split or explicitly stratify-sample for the dimension you care about (dialect, gender, age, county, whatever the metadata gives you). Don't trust shard order to be uniform on metadata you haven't checked. **Honest scope of the verdict.** This is measured on 198 unscripted dev_test clips. Sample sizes for the dialect stratification are 156 Kipsigis vs 42 Nandi — Nandi is the smaller side, which means its numbers carry more sampling noise. The 22-point dialect gap on T4a/T4c is large enough to be real signal even with that noise. The 0.57-point gap on T3 is well within sampling noise, which is why I'd call T3 balanced rather than provably-equal. What I won't claim: that T3 is provably equally good for every Kalenjin sub-tribe. The dataset only labels Kipsigis and Nandi. Speakers from Keiyo, Marakwet, Tugen counties are likely in there (rolled into one of the two main labels), but I cannot evaluate per-sub-tribe accuracy because per-sub-tribe labels don't exist. Speakers from communities likely absent from the dataset (Pokot, Sabaot, Terik, Sengwer) should expect worse performance. I cannot tell them how much worse. **Total Phase 4 spend:** roughly $20 across T4a, T4b, T4c-crashed, T4c-retry, all the follow-up chunked evals, and the dialect-stratification audits. Decision-quality data for the cost of a coffee shop dinner. **The T4-full question ($50+, all 137 unscripted shards plus replay) stays open but with a clearer specification.** If anyone runs it, stratify-sample by dialect during training to ensure both Kipsigis and Nandi exposure roughly match dataset proportions, otherwise you'll inherit whatever the shard-ordering happens to give you. The recipe (replay + chunked inference + decode-time hallucination guards) is documented; what's missing is the discipline to compose the training set carefully. **Companion languages** in the AfriVoices-KE family (Dholuo, Kikuyu, Maasai, Somali) are the obvious follow-up once the Kalenjin recipe is solid. The pipeline transfers. The same caveat about dialect / sub-group labels applies to each, and each language's `meta.csv` deserves the same audit-before-training treatment. ### One more decoding fix that mattered more than expected After everything above shipped, I tried two more decoder tweaks: beam search (instead of greedy decoding) and KenLM language-model rescoring on top of the beams. The expectation going in: beam helps a little, KenLM adds the bigger win. The published Whisper-LM paper reports 34-51% relative WER reduction from KenLM rescoring on low-resource languages. What actually happened:
Decode recipe WER (norm) CER (norm) Notes
T3 + chunked + greedy (the production recommendation we shipped with first) 82.79% 33.88% Single best token per step
T3 + chunked + beam=5 (the recommended recipe) 65.56% 21.10% −17.2 WER, −12.8 CER
T3 + chunked + beam=5 + KenLM rescoring (β sweep 0.1 → 100) 65.56% 21.10% 0 swaps across all β values, no change
**Beam search by itself dropped WER 16 points and CER 11 points** — bigger than every LoRA-recipe change explored across T1, T2, T3, T4a, T4b, T4c. The model was always capable of producing better output, but greedy decoding was systematically picking less-fluent tokens because they happened to score 0.01 higher acoustically than a more-fluent neighbor. KenLM rescoring did nothing. I trained a 5-gram model on 82,372 sentences from the train transcripts (~4.5 MB of clean Kalenjin text), then swept the language-model weight β from 0.1 all the way to 100 — at no setting did KenLM ever pick a different candidate from the acoustic top-1. The most likely explanation: the 5 beams returned by Whisper differ in only one or two tokens that are equally plausible Kalenjin spellings (e.g., `ngokenik` vs `ngogenik`), so the language model has nothing to disambiguate. To get useful rescoring, we'd need a wider beam (10-20) or diverse beam search to force the candidates to actually disagree. I'm leaving KenLM rescoring as a documented negative result. The recipe ships with beam=5 baked in; KenLM does not. This is the second time on this project the inference pipeline turned out to matter more than the training. First it was chunked inference (which fixed 46% truncated transcripts). Now it's beam search (which fixed picking-the-wrong-token). The lesson I'm walking away with: **fix the inference pipeline before retraining.** Most low-resource ASR writeups I've read since publishing this don't audit the inference side at all. They retrain with new techniques and report results. For this project, every inference fix outperformed every training-side change I tried. The current recommended decode recipe lives in the "Try it yourself" section below, with `num_beams=5` baked in. ### The meta-lesson Probe runs earn their keep by being allowed to fail, and they earn their keep again by being allowed to tie. A $15 probe told me scripted replay is the right direction and 40 shards isn't enough data. A $50 full run would have told me the same thing louder. That's a win, even if the WER number isn't. ### Try it yourself The shipping model is **T3**. After all the post-hoc checks (chunked inference, CER reporting, dialect stratification, training-data audit), T3 wins on every quality dimension and is balanced across both labeled dialects. Two artifact formats on Hugging Face: - [`Tonykip/whisper-kalenjin-lora-v3-turbo`](https://huggingface.co/Tonykip/whisper-kalenjin-lora-v3-turbo) — LoRA adapter only, ~50 MB. Compose with the base model at load time. - [`Tonykip/whisper-kalenjin-v3-turbo`](https://huggingface.co/Tonykip/whisper-kalenjin-v3-turbo) — merged full model, ~1.6 GB. Drop-in replacement for `openai/whisper-large-v3-turbo`. Both are MIT-licensed. **Quickstart (Python):** ```python from transformers import pipeline pipe = pipeline( "automatic-speech-recognition", model="Tonykip/whisper-kalenjin-v3-turbo", chunk_length_s=30, stride_length_s=(5, 5), return_timestamps=True, ) result = pipe( "your_audio.wav", generate_kwargs={ "language": "sw", # Swahili token = closest Bantu anchor "task": "transcribe", "num_beams": 5, # the single biggest decoder win we found "compression_ratio_threshold": 1.35, # cut hallucinated repetition tails "no_repeat_ngram_size": 3, "repetition_penalty": 1.15, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), # OpenAI's fallback schedule }, ) print(result["text"]) ``` This recipe handles audio of any length, uses beam=5 (~16-WER-point improvement over greedy), suppresses the repetition-tail failure mode, and falls back through temperatures if a chunk's compression ratio spikes (a reliable signal of hallucination). The numbers in this post's earlier *story-beat* sections used the original sequential or chunked-greedy decoders because that's the order I discovered the fixes in. With the recommended recipe (chunked + beam=5) — same model, same audio — the canonical eval numbers on the 198-clip unscripted set are **WER 65.56% / CER 21.10%** (source: [`canonical_metrics.json`](https://github.com/tonykipkemboi/whisper-kalenjin-lora/blob/main/04-results/artifacts/canonical_metrics.json)). The verdict table above and the dialect stratification both report this recipe. **Caveats anyone deploying this should hear:** - The model is a research artifact, not a production product. No SLA, no warranty, no guarantee of any specific WER on your audio. - Trained almost entirely on **scripted** read speech. Spontaneous unscripted speech transcribes meaningfully worse (~66% normalized WER, ~21% CER on our eval at the recommended recipe) — usable for "get the gist" but not for high-fidelity transcription of long conversational audio yet. - Dialect coverage measured for **Kipsigis and Nandi** only. Other Kalenjin sub-tribes (Tugen, Keiyo, Marakwet, Sabaot, Pokot, Terik, Sengwer) are partially or not represented in the training data — expect lower accuracy from speakers in those communities. - English regression: the model is roughly 5 WER points worse than base Whisper on English (mostly punctuation drift, not word errors). If you need bilingual ASR, run base Whisper for English audio. - If you're using `whisper.cpp` via `whisper-rs`, don't set `set_single_segment(true)` for audio longer than 30 seconds — that flag bypasses the sliding-window mechanism and was the cause of one specific failure mode I hit when first testing the merged model in a Rust pipeline. If you test it and find something interesting — good or bad, especially as a native Kalenjin speaker — reach out. The metric is a starting point. Your ear is the ground truth. **T4a and T4c are archived as research artifacts**, not deployed publicly. The story they tell is in the verdict above: training data composition matters more than fine-tune recipe at this scale, and the two of them serve as documented case studies of how shard-ordering bias shows up in low-resource fine-tunes. Worth keeping for the whitepaper. Not what I'd ship. ### Where this goes next Companion languages in the AfriVoices-KE family (Dholuo, Kikuyu, Maasai, Somali) are the obvious follow-up. The pipeline transfers. Each language has its own data access story, its own orthographic quirks, its own dialect variance. The recipe is public. The work is repeatable. Bigger picture: the gap between "99 languages Whisper knows" and "~6,000 languages humans speak" is not going to close from the top down. It closes from people who speak these languages running experiments on weekends with a small budget. That's a better fix than waiting for a lab to prioritize your language, and it's accessible to anyone who can get access to an open dataset and ~$25 of compute credits. If you speak a low-resource language and have a few hundred hours of transcribed audio, this whole playbook is yours. The training recipe lives on the [HF model card](https://huggingface.co/Tonykip/whisper-kalenjin-v3-turbo), the model is published, the costs are real. The hardest part is the data access, not the training. --- ## Acknowledgements This project does not exist without the [African Next Voices](https://huggingface.co/Anv-ke) team. They built the [AfriVoices-KE](https://arxiv.org/abs/2604.08448) corpus — roughly 3,000 hours of audio across Dholuo, Kikuyu, Kalenjin, Maasai, and Somali, collected from 4,777 native speakers across diverse Kenyan regions — and the Kalenjin slice I trained on is their work. Collecting that much quality audio, transcribing it accurately, splitting it into scripted and unscripted modes across eleven domains, and making it available to researchers under CC BY 4.0 is the kind of foundational data work that quietly makes low-resource language AI possible, and it usually goes unthanked. Thank you to the speakers, the transcribers, and the curators. Kalenjin on this model is your data, not mine. I'm just the person who fed it to a LoRA and wrote about the result. The project runs through the **KenCorpus consortium**, funded by the **Bill & Melinda Gates Foundation**: - **Maseno University** — project lead; Dholuo and Somali collection - **Kabarak University** — Kalenjin collection (the data this post is built on) - **United States International University (USIU)** — Maasai collection - **Dedan Kimathi University of Technology (DeKUT) and LDRI** — Kikuyu collection Specific people I want to name: - **[Dr. Lilian D. A. Wanzare](https://wanzare.github.io/)** ([LinkedIn](https://www.linkedin.com/in/liliwanzie/)) — Principal Investigator, Maseno University School of Computing & Informatics, research lead at the Maseno Centre for Applied Artificial Intelligence (MCAAI). Co-founder of KenCorpus. The whole project is hers to lead. - **[Dr. Andrew Kiprop Kipkebut](https://profiles.kabarak.ac.ke/eprofile/150)** — Kabarak University, Department of Computer Science. Led the Kalenjin collection effort. The audio I trained on came through his team. - **Prof. Collins Ouma** — Director of Research & Innovation, Maseno University. And the full list of [AfriVoices-KE](https://arxiv.org/abs/2604.08448) paper authors, all of whom contributed to making this corpus exist: > Lilian Wanzare, Cynthia Amol, Ezekiel Maina, Nelson Odhiambo, Hope Kerubo, Leila Misula, Vivian Oloo, Rennish Mboya, Edwin Onkoba, Edward Ombui, Joseph Muguro, Ciira wa Maina, Andrew Kipkebut, Alfred Omondi Otom, Ian Ndung'u Kang'ethe, Angela Wambui Kanyi, Brian Gichana Omwenga. A few other shoulders this stood on: - **OpenAI's Whisper team** for releasing `whisper-large-v3-turbo` under MIT — the base model everything here is built on. - **The HuggingFace PEFT and `transformers` maintainers** — most of this code is one or two lines away from examples they ship. - **Modal** for $30 in starter credits that covered the entire project end-to-end. - The authors of **LoRA** (Hu et al. 2021) and its 2024-2026 successors (**DoRA**, **rsLoRA**, **PiSSA**, **LoRA+**). Their work made parameter-efficient fine-tuning real, and the next iteration of this project will lean on them directly. - **Microsoft Research** for releasing [Paza](https://huggingface.co/microsoft/paza-whisper-large-v3-turbo). Even though the head-to-head didn't land cleanly, their existence sets the ceiling I'm trying to reach. - Every practitioner who posted a Whisper fine-tuning notebook, debug trace, or obscure transformers-version-specific bug report on GitHub. Hours of my life saved by your willingness to document your confusion publicly. ### Dataset citation If you use or reference the AfriVoices-KE Kalenjin corpus, please cite the paper: ``` Wanzare, L., Amol, C., Maina, E., Odhiambo, N., Kerubo, H., Misula, L., Oloo, V., Mboya, R., Onkoba, E., Ombui, E., Muguro, J., wa Maina, C., Kipkebut, A., Otom, A.O., Kang'ethe, I.N., Kanyi, A.W., Omwenga, B.G. (2025). AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages. arXiv:2604.08448. ``` Dataset homepage: [huggingface.co/datasets/Anv-ke/Kalenjin](https://huggingface.co/datasets/Anv-ke/Kalenjin). Project website: [maseno.ac.ke/african-next-voices-workshop](https://www.maseno.ac.ke/african-next-voices-workshop). The dataset is released under **CC BY 4.0** with a gated-access agreement. Please respect the access agreement and request access directly rather than reusing audio from this project's artifacts. This blog post ships only model weights and predictions text — no raw audio from the dataset is redistributed here. --- ## Lessons log - **Read the script before you run it.** A dtype mismatch cost $0.02. Not much, but the habit scales. On T2 ($2+) or T3 ($10+) it would have hurt. - **HF `datasets.Dataset.from_list` plus `.map()` mangles binary payloads.** I built a Dataset from rows containing raw WAV bytes, then called `.map()` to decode them in a preprocessing step. The Arrow cache layer silently corrupted the bytes. Soundfile threw `Format not recognised` trying to read them back. The same bytes decoded cleanly outside the datasets pipeline. Fix: ditch `datasets.Dataset` entirely for this workload. Use a plain `torch.utils.data.Dataset` subclass with lazy per-example decoding. Simpler, faster, works. - **DataLoader workers plus soundfile plus pyarrow fork poorly.** `LibsndfileError: ` — the error message itself couldn't be stringified. Classic sign of a C-string that got freed during a process fork. `num_workers=0` sidesteps it cleanly for now. Revisit for scale. - **Don't inherit from `torch.utils.data.Dataset` in a file that Modal parses locally.** Modal CLI evaluates the whole file on your laptop to build the app graph before shipping to the container. A module-level `import torch` or class inheritance from a torch class will crash if your local venv doesn't have torch. DataLoader duck-types any class with `__len__` and `__getitem__`. No inheritance needed. Modal forces you to think about local vs. remote evaluation context in a way other deploy tools don't. - **`load_best_model_at_end=True` is silently broken with PEFT.** The Seq2SeqTrainer reloads from `model.safetensors`; PEFT only saves `adapter_model.safetensors`. In transformers 4.46 this either errors at end-of-training or silently reloads base weights, making final eval meaningless. Either skip this flag and pick the best adapter manually, or write a callback. I skipped it. - **Two reviewers beat one, and one reviewer beats none.** After three failed launches, I ran a code-only review agent. Found 6 issues in my pre-review script. Then a doc-cross-referencing review agent. Found 6 MORE issues the first missed, including the `load_best_model_at_end` PEFT bug I'd never have found without reading the actual PEFT changelog. A 200-line training script on a fast-moving stack has more surface area than a single read can cover. - **Whisper's `enable_input_require_grads()` hooks the wrong module.** The PEFT docs tell you to call this for gradient checkpointing compat. It works on text models where the first trainable op going backward is an embedding. On Whisper the first op is a conv (`encoder.conv1`). You need an explicit forward hook on that conv or training will fail with `element 0 of tensors does not require grad`. The fix is a 3-line forward hook. You won't find it in the standard PEFT docs. You need to look at actual Whisper LoRA notebooks (Vaibhavs10/fast-whisper-finetuning was my source). - **`bf16=True` autocast wraps training, NOT `Seq2SeqTrainer.prediction_step`.** Under the hood, `bf16=True` in TrainingArguments enables `torch.autocast(cuda, bf16)` for the training forward/backward. Eval goes through `Seq2SeqTrainer.prediction_step` which calls `model.generate()`, and `generate()` is NOT wrapped in autocast. If your model weights are bf16 and your collator outputs fp32 features, training works (autocast casts inputs) but epoch-end eval crashes with a dtype mismatch at conv1. Fix: subclass Seq2SeqTrainer and wrap `prediction_step` with your own autocast context. Four lines. - **Whisper's `bos_token_id` is not its `decoder_start_token_id`.** Many LoRA fine-tuning cookbooks copy-paste a collator that strips duplicate BOS. For Whisper specifically, `bos_token_id = <|endoftext|>` and `decoder_start_token_id = <|startoftranscript|>`. The strip silently fails if you check the wrong one. Worth a closer look at official HF scripts. - **First samples lie. Always query the metadata column.** The first five random clips I pulled were all Healthcare. Twenty lines of Python on the `domain` column showed Healthcare was actually 11% of scripted and 4% of unscripted. Sampling alone, even with a uniform RNG, can land you in a domain pocket — read the metadata if you can. - **Language-token choice for unseen languages is a free hyperparameter.** I picked `sw` (Swahili) because Kalenjin speakers in Kenya also speak Swahili, and both share Bantu-ish phonology. `en` or `hi` (Hindi, another catch-all) are reasonable alternatives. Worth an A/B once the pipeline works. --- ### Text Measurement Without the DOM - Published: 2026-03-30 - Category: Developer Tools - Tags: JavaScript, Developer Tools, Performance - URL: https://tonykipkemboi.com/blog/pretext-text-measurement A tiny JS library that calculates text height at any width without touching the DOM. Built by the creator of react-motion. #### Full Content via [@_chenglou](https://x.com/_chenglou/status/2037713766205608234) Came across [pretext](https://github.com/chenglou/pretext) on X this weekend and spent some time playing with it. It solves a specific problem: calculating how tall a block of text will be at a given width, without rendering it in the DOM. Normally you'd have to put the text in a div, measure `offsetHeight`, and deal with the reflow cost. Pretext does it in pure JavaScript using the browser's font engine via an off-screen canvas. Two functions. **`prepare(text, font)`** measures all the word segments once. **`layout(prepared, width, lineHeight)`** calculates the height at any width. The prepare step is the expensive one. The layout step runs in about 0.09ms for 500 texts, so you can call it on every resize without thinking about it. Built by [Cheng Lou](https://github.com/chenglou), who created react-motion. Few KB, supports all languages including RTL and CJK, validated against the Great Gatsby rendered across browsers. ## Try it Type something, drag the width slider, see the numbers update. Left side is pretext (no DOM). Right side is the actual rendered text. ## Where this is useful Virtualized lists where you need item heights before rendering. Canvas or WebGL text where there is no DOM. Preventing layout shift by calculating dimensions before painting. AI-driven UI testing where you want to verify layout without a browser. Where it is not useful: static sites, server-rendered pages, anything where the text is already in the DOM and you can just read `offsetHeight`. ## Install ```bash npm install @chenglou/pretext ``` ```typescript import { prepare, layout } from '@chenglou/pretext' const prepared = prepare('Your text here', '16px sans-serif') const result = layout(prepared, 400, 24) // width, lineHeight console.log(result.height) // pixel height ``` Writing this mostly so I remember to reach for it next time I need text measurement in a project. If you are building anything with virtualized lists or canvas text, worth bookmarking. Source: [github.com/chenglou/pretext](https://github.com/chenglou/pretext) --- ### AI Agent Skills and Plugins Explained (2026) - Published: 2026-03-27 - Category: AI Agents - Tags: AI Agents, Agent Skills, Plugins, Enterprise AI - URL: https://tonykipkemboi.com/blog/agent-skills-and-plugins-explained Skills are reusable instructions for AI tools. Plugins bundle them for distribution. Almost every major AI tool uses the same format. #### Full Content ![A recipe box labeled My Skills with colorful recipe cards for Daily Standup, Meeting Prep, Schedule, Get Todos, and Weekly Recall](/blog/skills-and-plugins-hero.jpeg) Two terms keep coming up in AI tooling conversations: skills and plugins. They sound similar. They are not. A **skill** is a set of instructions that tells an AI tool how to do one specific task. A **plugin** bundles multiple skills (and other components) into a package you can install and share. Skill is the recipe. Plugin is the cookbook. Both matter because they change how organizations scale AI adoption. Instead of every person writing their own prompts from scratch, you write a skill once and everyone uses it. ## What a skill looks like A skill is a folder with one file: `SKILL.md`. Plain markdown. No programming required. ```yaml --- name: review-pr description: Review a pull request for security, tests, and style violations --- # Pull Request Review 1. Read the diff of the current pull request. 2. Check for common security issues. 3. Verify that new functions have tests. 4. Flag style guide violations. 5. Post review comments on the PR. ``` The top section is metadata: a name and a description. The bottom is instructions the AI follows step by step. That is the entire format. The AI reads the description to decide when to use it. When a skill matches what you asked for, it loads the full instructions. When nothing matches, it handles your request normally. You can also call a skill directly by name. ## What a plugin looks like A plugin wraps skills and other components into one installable package. It has a manifest file (`plugin.json`) with a name, version, and description. Beyond skills, a plugin can include: - **Agents**: Specialized AI assistants tuned for specific jobs - **Hooks**: Automated actions triggered by events (like running a linter after every edit) - **MCP servers**: Connections to external tools like Slack, Jira, or a database - **Commands**: Custom slash commands that kick off workflows You install a plugin with a single command. Everything inside it, including the external tool connections, comes with it. No separate setup. ## When to use a skill vs a plugin **Use a skill** when you have one repeatable workflow. Deploying to staging. Reviewing a PR. Prepping for a sales call. Writing a status update in a specific format. If the task is self-contained and the instructions fit in one file, a skill is enough. **Use a plugin** when you have a collection of related skills that should travel together, or when the workflow needs external tool connections. A "sales enablement" plugin might bundle skills for call prep, account research, and deal analysis, plus an MCP connection to your CRM. The plugin keeps all of that packaged and versioned as one thing. **Use neither** when the task is genuinely one-off. Not everything needs to be a skill. If you are only going to do it once, just ask the AI directly. ## The format is universal This is the part worth knowing. Almost every major AI tool adopted the same `SKILL.md` format. [Claude Code](https://code.claude.com) (Anthropic), [Codex CLI](https://developers.openai.com/codex) (OpenAI), [Gemini CLI](https://geminicli.com) (Google), [GitHub Copilot](https://github.com/features/copilot), [Cursor](https://cursor.com), and [Windsurf](https://windsurf.com) all read the same files. Write a skill once and it works across all of them. Anthropic released the [Agent Skills specification](https://agentskills.io) as an open standard in December 2025. OpenAI and Google adopted it shortly after. All three co-founded the [Agentic AI Foundation](https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation) under the Linux Foundation to govern the standard. Over 30 tools support it now. **The reason it spread fast is that skills are just markdown.** No binary format, no compilation, no runtime. The barrier to adoption is zero. Some tools that do not support the shared format (Amazon Q, JetBrains AI, Aider, Continue.dev) have their own customization systems, but those are siloed. You cannot take them across providers. ## How companies distribute them Every provider supports a similar pattern for pushing skills and plugins to an organization. An admin creates a central repository (usually on GitHub) with a list of approved plugins. Employees get prompted to install them when they open a project. Admins control what gets installed: available for self-service, installed by default, required, or hidden for staging. Admins can also lock down which sources employees are allowed to install from. This matters because **[12-20% of skills on at least one public marketplace were found to be malicious](https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/)** in security audits. I wrote more about [the broader AI agent security landscape](/blog/ai-agent-security) separately. The governance model across all providers is the same: anyone can create a skill locally, sharing goes through version control, and org-wide deployment requires admin privileges. Low friction to experiment. Controlled friction to distribute. ## How skills connect to external tools Skills tell the AI what to do. But a skill that says "post the result to Slack" needs a way to actually reach Slack. There are a few ways to wire that up. Direct API calls work if your skill includes a script that hits an endpoint. Some plugins bundle custom tool integrations that call APIs directly. But the approach gaining the most traction is [MCP](https://modelcontextprotocol.io) (Model Context Protocol), an open standard for connecting AI tools to external services. Every major AI tool supports it now, and MCP servers can be bundled inside plugins so they install automatically. MCP is not the only option, but it is becoming the default because it is standardized. You write the connection once and it works across providers, rather than building separate integrations for each tool. ## The security side Three numbers worth knowing: - **12-20%** of skills on [ClawHub](https://openclaw.com) (a public marketplace) were [found to be malicious](https://snyk.io/blog/toxicskills-malicious-ai-agent-skills-clawhub/) - **1 in 8** enterprise security incidents now involve an AI agent system ([CrowdStrike 2025](https://www.crowdstrike.com/en-us/global-threat-report/)) - **78%** of compromised agents had broader permissions than they needed The [OWASP Foundation](https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/) published a top 10 for AI agent security. The top threats: goal hijacking, tool misuse, over-permissioned agents, supply chain attacks through malicious plugins, and unexpected code execution. The practical version: install skills from trusted sources only, keep permissions tight, review skills like you review code, and pin your versions. ## Where this is going Only [20% of companies](https://www.deloitte.com/us/en/what-we-do/capabilities/applied-artificial-intelligence/content/state-of-ai-in-the-enterprise.html) have a mature governance model for AI agents. The skills ecosystem went from nothing to hundreds of thousands of indexed skills in under four months. Over 30 products adopted the same format in the same period. 60,000+ GitHub repos use the cross-tool AGENTS.md instruction standard. **The companies that figure out how to create, share, and govern skills are going to move faster than those still asking every person to write their own prompts.** The format is markdown. The barrier is low. The hardest part is identifying which workflows are worth standardizing, and the people doing the work already know the answer to that. --- ### The Case for One Company Harness (Not More Agents) - Published: 2026-03-13 - Category: AI Agents - Tags: AI Agents, Enterprise AI, Claude Code, Cowork, Agent Harness, Skills - URL: https://tonykipkemboi.com/blog/one-company-harness You do not need a new agent for every use case. You need a harness flexible enough to serve different roles and skills as the configuration layer. #### Full Content ![A Swiss army knife standing upright surrounded by scattered individual tools, illustrating one harness replacing many single-purpose agents](/blog/one-company-harness-hero.png) I spend most of my time figuring out what needs automating inside the company. Stakeholders come to me with a workflow that's slow or manual or both, and we figure out whether to build or buy. Most of the time I end up building custom agents. Internal-facing, always aimed at making employees faster and automating the boring stuff. I started doing this late last year. Custom agents were the default for anything interesting. And for a while it made sense. The work starts before any code gets written. I meet with the person or team that owns the workflow and have them walk me through every single thing they do manually. We turn that into a PRD including an initial eval dataset. Sounds straightforward but it never is. People forget steps they make unconsciously because they've done it a thousand times. That context lives entirely in their head. Unless you pull it out and put it somewhere the agent can access, the agent fails. Not because the agent is bad but because it's missing pieces that nobody thought to write down. **This is one of the silent killers of agent projects.** You can bolt on every guardrail you want but if the context isn't there the agent is guessing. Getting that knowledge transfer right is the actual hard part. I came from the agent framework world so naturally that's where I started. But I felt the constraints almost immediately. You inherit the framework's opinions and sometimes those don't match what you need. I wrote about this shift on my [blog](/blog/agent-frameworks-getting-squeezed) if you're curious why. I tried a couple of SDKs and APIs, landed on one that fit the deployment environment. Agent backend on a cloud platform, chat frontend as the interface, observability bolted on for tracing and evals. The whole setup was solid for what it needed to be at the time. ## Then December 2025 happened New model capabilities dropped with Opus 4.6 and everything shifted. Reasoning and tool use got dramatically better. Then [Cowork](https://claude.ai) launched with [skills and plugins](/blog/agent-skills-and-plugins-explained). A skill is a single recipe for the model to execute a given workflow. A plugin is the cookbook: a collection of recipes you pull from depending on what you need. That changed the math on everything I'd already built. The custom agents were rigid and slow to ship new features. Meanwhile the prompts, workflows, and data transformations I'd already written could be packaged as skills and suddenly scale in ways the custom agent couldn't. ## The pivot Most of what lived inside my custom agents could be converted into skills. The prompts, the output logic, even custom scripts, all expressible as a skill or bundled into a plugin. [MCP](https://modelcontextprotocol.io) connectors let me bolt on new capabilities on demand without rewriting any code. Slack messages now send directly from the Claude harness, instead of needing a one-off integration. I can also set up scheduled tasks once and have them run on a regular cadence as long as my laptop stays on. It's not the most efficient setup, but it works for now. **Every custom integration I create is something I'm on the hook to maintain indefinitely.** Skills and plugins move that maintenance burden onto the platform instead. ## The observability tradeoff On a custom agent, debugging a workflow, for example a bad sales forecast, means pulling a full trace, stepping through all the tool calls, and hunting for the one prompt or retrieval that went sideways. On Cowork, the same person tweaks the skill prompt, reruns it on a handful of deals, and sees within minutes whether the numbers finally match their spreadsheet. That's a great trade for content, ops, and reporting workflows but it's a bad one for payments, compliance reviews, or anything you may need to audit in detail later. ## This isn't for everyone What I'm describing does not apply to every organization, industry, or use case. If your company already has the license and native connectors, teams like sales and marketing can build playbooks as skills and start moving faster immediately. But regulated industries, strict data residency, compliance-heavy environments: those aren't migrating to Cowork and calling it a day. Custom agents still make sense when the workflow needs deep proprietary integration, when you need granular control over every agent decision, or when audit requirements demand full stack ownership. What's changed is the bar. **The default should now be to ask whether an agent skill can handle it first.** ## Case for the harness [Claude Code](https://code.claude.com) started as a developer tool. Terminal-based, built for engineers. [A harness, not a framework](/blog/agent-frameworks-vs-harnesses). But people used it for things well beyond coding. Anthropic saw that and built a new harness, Cowork, for the non-developers who don't want a terminal. Same model. Different harness. Different audience. That's the mental model. **You don't need a new agent for every use case. You need a harness flexible enough to serve different roles** and robust enough to support skills, plugins, scheduled tasks, and whatever comes next. The harness is the product. Skills and plugins are the configuration layer. The harness should get smaller over time. As models improve, things currently in the harness get trained into the model itself. So, theoretically, your harness code should be shrinking as more capable models are released. Prompts get shorter. Custom tools get replaced by native capabilities. The harness compresses toward the minimum viable wrapper around whatever the model can't yet do on its own. A company-owned harness also serves as a backup. It can swap models if a provider goes down. No single provider should be a single point of failure. ## The people topic There are too many platforms. Every new tool is another tab, another login. Even for me it's a lot. The adoption unlock is bringing the workflow to where people already are (for example [Slack](https://slack.com)) rather than guiding them to yet another interface. Friction kills adoption. This is another argument for the single harness. One interface. The harness routes to the right skills and data behind the scenes. The user doesn't need to know which model is running or which plugin is active. They describe what they need and the harness figures it out. Everything here is a point-in-time view and a snapshot in time. Most of the specific tooling will look different in a few months. That's the job right now. You build while the ground moves and stay ready to throw things away when something better shows up. --- ### Agent Frameworks Are Getting Squeezed - Published: 2026-03-02 - Category: Industry Insights - Tags: AI Agents, Agent Frameworks, Enterprise AI, Automation, AI Infrastructure - URL: https://tonykipkemboi.com/blog/agent-frameworks-getting-squeezed Agent frameworks emerged in the gap between when models got good enough and when infrastructure caught up. That gap is closing from both directions. #### Full Content ![Timeline of agent frameworks from 2020 to 2026, showing the evolution from GPT-3 through framework genesis, explosion, enterprise push, the squeeze, and platform consolidation](/blog/agent-frameworks-timeline.png) When you look at what most agent frameworks actually do, it's workflow orchestration. You define tasks, chain them together, route data between steps, add conditional logic, call external APIs. The core mechanics look familiar because we've been doing this with automation platforms for over a decade. Agent frameworks emerged in 2023 when models became capable enough to reliably use tools and reason through multi-step tasks. They were built around LLMs and reasoning as first-class primitives. Automation platforms like [Zapier](https://zapier.com) and [Make](https://www.make.com) had been built around apps and triggers since the early 2010s. Both solve the same fundamental problem: coordinating work across multiple systems. For about two years, agent frameworks had a clear opening. Models were good enough to be useful, but the existing automation platforms hadn't adapted yet. Frameworks like [LangChain](https://www.langchain.com), [AutoGen](https://github.com/microsoft/autogen), [CrewAI](https://www.crewai.com), and others filled that gap with developer-friendly tools for building agentic workflows. By mid 2025 and into 2026, something happened. The market started closing in from both directions. ## From Above: AI Labs More capable models have been released and products like [Claude Desktop](https://claude.ai) ships with Computer Use and multiple connectors. With something like Claude Cowork, you can connect to your data sources, spin up sub-agents for specific tasks, schedule tasks, and orchestrate everything from one command center. It runs on your desktop. The model, the orchestration, and the integrations all come from one place. [OpenAI](https://openai.com) is building similar capabilities and probably more with the recent acqui-hire of [OpenClaw](https://openclaw.com) founder. So is [Google with Gemini](https://gemini.google.com). **The AI labs aren't just providing models anymore. They're providing the entire agent runtime.** For enterprises, this matters. Most are already paying for these services. They have licenses for their employees to use Claude or OpenAI or Gemini and most instances all of them are provisioned. Why layer in a separate orchestration framework when the lab that built the model also built the agent infrastructure? The integration is tighter. The debugging is easier. The responsibility is clearer. And there's no additional vendor to manage or budget line to justify. The decision point is becoming harder to justify. Why add another framework to do something the existing tooling already handles? That cost conversation gets difficult fast, especially in enterprises where every new tool needs security review, procurement approval, and ongoing maintenance. Then came plugins and skills. Claude launched [agent skills and plugins](/blog/agent-skills-and-plugins-explained) that let organizations build and share domain-specific capabilities. Finance plugins. Legal plugins. Productivity plugins. You can add skills for evaluating NDAs, verifying contracts, processing specific workflows. These are shareable across the organization and built directly into the platform employees are already using. This hit the market hard. The announcement affected valuations for vertical AI companies because, in my opinion, it changed the pricing conversation. Enterprises can now argue they can do most of what specialized tools offer using Claude. That doesn't replace those companies outright, but it compresses what they can charge. Revenue expectations shift when the baseline capability is free with an existing license. ## From Below: Automation Platforms [Zapier](https://zapier.com) launched Agents. [Make](https://www.make.com) launched AI Agents. [UiPath](https://www.uipath.com) calls it Agentic Automation. They already had thousands of pre-built connectors, OAuth handling, permission management, and enterprise governance. They just needed to add reasoning on top. And they did. These platforms spent over a decade building integration infrastructure. Adding LLM-based reasoning to existing workflow orchestration is straightforward compared to building thousands of enterprise integrations from scratch. ## The Vendor Lock-In Question The strongest case for agent frameworks is model-agnostic flexibility. Build once, swap providers with a config change. No lock-in to a single lab's ecosystem. Recent events show why this matters. The Pentagon just designated Anthropic a supply chain risk over a dispute about autonomous weapons and surveillance guardrails. The company lost its $200 million contract, and military contractors can no longer use Claude for defense work. The situation is fluid and contained to government contracts for now, but it demonstrates the risk of platform dependence. What happens when an enterprise decides they can't use a specific lab anymore? Policy disagreements, pricing changes, compliance requirements. If you built everything on Claude or a similar single-vendor platform, you're ripping out infrastructure. If you built on a framework with swappable model providers, you're changing a config file. That's real value. But it's not the moat frameworks think it is. OpenAI [launched Frontier](https://openai.com/index/introducing-openai-frontier/) in February. An enterprise platform for building and managing AI agents with integrated access to business systems, data warehouses, and internal apps. It's open to agents built outside OpenAI's ecosystem. It has governance, permissions, and compliance tooling. ![OpenAI Frontier architecture showing interfaces, agents, evaluation, execution, and business context layers with enterprise security and governance](/blog/openai-frontier-architecture.png) It's OpenAI's bid to become "the operating system of the enterprise." And it directly addresses the vendor lock-in concern by positioning itself as a control plane that can work across providers. Google will build their version. Microsoft already has paths through Azure. You can bet all the big labs are working hard to capture this enterprise market. They're all building platform layers that reduce single-vendor risk while keeping you in their ecosystem. The competition between labs actually gives enterprises options. Each lab will have different policies, different pricing, different compliance stances. That diversity is its own form of protection against lock-in. And most enterprises would rather manage relationships with two or three major labs than maintain a separate orchestration framework. Frameworks still have a role for teams that need code-level control or specific orchestration patterns. But the vendor lock-in argument gets weaker when the labs themselves are building multi-provider management platforms. The freedom frameworks offer comes with its own dependencies on integration layers, observability tools, and ecosystem partners. True portability requires discipline at the architecture level, not just picking the right vendor. And most enterprises will bet on the labs that own the models rather than add another layer to maintain. ## The Middle Collapses Agent frameworks emerged in the gap between when models got good enough to be useful (2023) and when the infrastructure caught up (2025-2026). That gap is closing. The integration disadvantage that frameworks faced in 2023 and 2024 is gone. Companies like [Composio](https://composio.dev) and [Arcade.dev](https://arcade.dev) built integration layers specifically for agents. Most frameworks now use these external tool companies for connections. But solving integrations doesn't solve the squeeze. **AI labs are building down into orchestration. Automation platforms are building up into reasoning.** Agent frameworks are in the middle of a compression event from both sides. Frameworks still have an architectural advantage. They were designed with agents as the default primitive from day one. The developer experience is built for agentic workflows. That matters for prototyping and experimentation. As the underlying technology improves, architectural advantages compress. Models get better at reasoning and tool use. The abstraction layer matters less. The orchestration patterns start looking similar regardless of where they come from. There's also a learning curve problem. Agent frameworks are opinionated. You have to learn their language, understand how they structure things, adapt to their patterns. That's friction. Compare that to going to Claude and letting it figure things out. The path of least resistance wins in enterprise adoption. ## Who Actually Uses Agent Frameworks? Fortune 500 companies might experiment with agent frameworks. Some are using them now. But there's a shelf life to that adoption. The cost justification gets harder when AI labs and automation platforms fill the capability gaps. The companies that stick with agent frameworks long-term are primarily consultancies. Large system integrators and boutique AI consulting firms build on these frameworks to deliver custom solutions faster than building from scratch. They white-label agentic transformation for enterprise clients, maintaining ongoing engagements through customization and integration work. That's a real market, but it's narrower than the original total addressable market frameworks were pitching. Consultancies are intermediaries, not end customers. And they'll switch frameworks as easily as they switch any other tooling if something better comes along. ## Where Frameworks Go From Here Historically in infrastructure, value concentrates around integration points and operational tooling, not orchestration patterns. Orchestration logic is portable. Integrations are becoming portable too. The moats frameworks thought they had are evaporating. Agent frameworks had a moment between when models got good enough and when the infrastructure caught up. That window is closing. What's left is open source projects maintained by communities and niche tools for teams that need control over convenience. Some will pivot to agent management services for consultancies and small shops but most will settle into being developer tools for prototyping before production deployment elsewhere. Both paths are arguably profitable, but neither is the venture-scale platform play frameworks pitched in 2023. --- ### Agent Frameworks vs Agent Harnesses - Published: 2026-02-27 - Category: AI Agents - Tags: AI Agents, Agent Frameworks, Developer Tools, LangChain, CrewAI - URL: https://tonykipkemboi.com/blog/agent-frameworks-vs-harnesses Frameworks give you building blocks. Harnesses give you a complete system. The distinction matters if you are building agents. #### Full Content ![Spectrum showing Raw Code on the left (least opinionated), Agent Frameworks in the middle (modular, composable, swappable), and Agent Harnesses on the right (most opinionated, batteries included)](/blog/frameworks-vs-harnesses-spectrum.png) If you've been hearing "agent framework" and "agent harness" thrown around and can't tell the difference, you're probably not alone. The terms sound interchangeable but they're not. I worked at [CrewAI](https://www.crewai.com), which is an agent framework, so I have a sense of where the boundaries are. I also wrote about [the market compression squeezing frameworks](/blog/agent-frameworks-getting-squeezed) from both directions. Agent frameworks and agent harnesses sit at **different points on a spectrum of opinionation**. Understanding where they sit matters if you're building agents, because it changes what you're responsible for and what the tool handles for you. If you put agent development on a line, raw code with no abstractions sits on the far left. You're calling APIs directly, managing state yourself, building every piece from scratch. Total flexibility, total responsibility. Agent frameworks sit in the middle. They give you structure and abstractions, but you still make a lot of decisions. You pick the memory system, you configure the tools, you define the orchestration logic. The framework has opinions about how things should connect, but it's modular. You can swap components. Agent harnesses sit on the far right. They're maximally opinionated. Everything is baked in. You add your API keys, maybe point it at a few tools, and it runs. Memory, context management, the agent loop, safety checks. All of that is decided for you. A framework gives you abstractions for building agents. You define roles, tasks, tools. You specify how agents coordinate, whether they work sequentially or hierarchically. The framework handles the plumbing. Calling the LLM, routing tool outputs, managing the execution loop. But you're still making architectural decisions. The framework is opinionated about what the building blocks look like. It has a memory abstraction, a tool interface, a task structure. But those pieces are swappable. If you don't like the default memory implementation, you can plug in your own. If you want to use a different LLM provider, you configure it. The framework gives you a standard interface, but you're still composing the system. That modularity is the point. Frameworks are built for people who want to build agents, not just use them. You're expected to understand how the pieces fit together, because you're the one deciding which pieces to use. A harness doesn't give you building blocks. **It gives you a complete system.** The best recent example is [OpenClaw](https://openclaw.com), which went viral a few weeks ago. It's a harness. You download it, add your API keys, and suddenly you have an agent you can chat with on WhatsApp, Telegram, and other platforms. Memory is handled. Context management is handled. The agent loop is handled. Tool calling, permissions, state persistence. All of it is built in. You're not configuring a memory system. You're not deciding how tools get registered or how the agent recovers from errors. Those decisions were made by whoever built the harness. Your job is to point it at a task and let it run. That's the tradeoff. You get something that works immediately, but you don't get to change how it works under the hood. The harness has an opinion about everything, and you're accepting that opinion when you use it. The spectrum matters because it maps to different problems. If you're prototyping, experimenting, or building something custom, you want a framework. You need the flexibility to swap components, test different approaches, and control the details. The framework gives you structure without locking you in. If you need something that works now, reliably, for a specific use case, you want a harness. You're trading control for speed. The harness has already solved the hard problems. Context management, durable execution, error recovery. You're just using the solution. **Frameworks are for builders. Harnesses are for users.** That doesn't mean one is better. It means they're solving different problems. What you're solving determines which one you need. The line isn't always clean, and I'm not sure it should be. Some frameworks are adding harness-like features. [LangChain](https://www.langchain.com) is a good example. They released [Deep Agents](https://blog.langchain.com/deep-agents/), which they explicitly call an "agent harness" that sits on top of their framework. It comes with built-in planning tools, file system access for context management, subagent spawning, and memory persistence. You're still using LangChain under the hood, but Deep Agents gives you batteries-included defaults so you don't have to wire everything together yourself. LangChain actually [distinguishes between three layers](https://blog.langchain.com/agent-frameworks-runtimes-and-harnesses-oh-my/) in their own stack. LangChain (the original library) is the framework. [LangGraph](https://www.langchain.com/langgraph) is what they call the "agent runtime," which handles execution, state management, and durability. Deep Agents is the harness that sits on top of both. That's one company spanning the entire spectrum. Framework for composing agents, runtime for executing them reliably, harness for using them out of the box. That's a framework company moving right on the spectrum. Deep Agents is still modular. You can swap backends, configure tools, adjust prompts. But it gives you a working system without requiring you to assemble every piece. On the flip side, harnesses aren't as locked down as they might sound. Take [OpenClaw](https://openclaw.com). It's maximally opinionated out of the box, but if you download the source code, you can swap implementations. You can change how memory works, adjust the agent loop, modify tool handling. It's just that most people won't, because the default already works. The distinction is about what's already decided when you start. **A harness ships with decisions baked in. A framework ships with options exposed.** If you're using a harness, you're accepting most of those decisions and configuring around the edges. If you're using a framework, you're making those decisions yourself and assembling the system. What you're solving determines which one you need. Sometimes you need to bypass the agent frameworks entirely and build a simple [ReAct agent](https://arxiv.org/pdf/2210.03629) using the model endpoints directly. How much you want already built determines which one you pick. --- ### Be a Generalist (if you have to choose) - Published: 2026-02-26 - Category: Career - Tags: Career, AI Agents, Generalist, Enterprise AI - URL: https://tonykipkemboi.com/blog/be-a-generalist AI got really good at going deep. What it cannot do is range across domains and know which room to open. That is the generalist advantage. #### Full Content ![A person standing on a ridge at golden hour, looking out over a landscape where desert, farmland, forest, coastline, and city skyline all converge](/blog/be-a-generalist-hero.jpg) There's a version of career advice that has felt true for a long time. Get deep in one thing. Become the expert. Own a vertical so completely that you become the person people call. The T-shape model, broad surface with one deep spike, has been the framework for a while now. That advice isn't wrong. But I think the T is shifting. AI got really good at going deep. Not at everything, and not perfectly, but direct an agent toward a specific problem with enough context and it will outpace most generalists and keep pace with many specialists. It can write the SQL, parse the research paper, draft the legal summary, generate the code. **Depth, for a lot of tasks, is becoming a commodity.** What AI is genuinely not good at is ranging. Connecting a pattern from genomics to data engineering to agent orchestration and knowing why that matters. Holding five different mental models at once and deciding which one applies. That's not a training problem. That's a life problem. And it's one generalists have been quietly solving for years. A generalist carries a lot of surface area. They might have a foot of understanding across a dozen domains, not deep, but real. Enough to have a conversation, ask the right question, recognize when something's off. With AI, that foot becomes a foundation. You point the model at the vertical, give it direction, and now you're operating with something closer to depth, on demand, across all of them. A specialist with a narrow spike is still valuable. They can verify. They can catch what the model gets wrong. That matters. But they're working in one room. The generalist is moving between rooms and knows which ones to open. I didn't plan to be a generalist. I don't think anyone does. I grew up in Kenya, came to the U.S. for college, joined the Army, spent years in medical labs and genomics research, taught myself Python from a book someone handed me, enrolled in a CS program at Penn, became a data engineer, then a developer advocate, and now I build agentic systems in production. None of that was a strategy. It was just following what felt interesting at each turn. But the through line is surface area. Each of those environments gave me a different mental model. The lab gave me rigor, you don't trust output you haven't validated. The Army gave me structure, define the mission, assign the resources, execute, debrief. Growing up in a different country gave me something harder to name, a kind of distance from assumptions that most people carry without knowing it. You learn to observe a system before you assume you understand it. All of that is what I draw on now when I'm working with AI systems. Not any single deep expertise. The range. So the question isn't whether you should be a generalist or a specialist. Most people don't get to choose cleanly anyway. Life makes you both things at different times. The better question is what have you already lived through that you're not counting. The career change you made five years ago. The industry you left. The country you grew up in. The job that felt like a detour. Every one of those things changed how you see, and that's what you're actually working with now. For a long time, the quiet anxiety of being a generalist was feeling one step away from being found out when something got hard enough. That you needed to pick a lane, go deep, stop moving around. That breadth was a liability. That's inverting. The depth you need on any given problem is increasingly something you can get to. **The range you've built over a lifetime isn't.** The people who are going to build the most interesting things over the next few years aren't necessarily the ones who know one domain the best. They're the ones who can hold multiple domains at once, move between them, and point the right tools at the right problems. There's a good chance that's already who you are, and you just haven't had a reason to call it an advantage until now. --- ### Securing the AI Frontier - Published: 2025-05-01 - Category: Security - Tags: AI Security, AI Agents, Enterprise AI, Cybersecurity, Prompt Injection - URL: https://tonykipkemboi.com/blog/ai-agent-security AI agents are amplifying the need for AI security. The global AI cybersecurity market is projected to reach $135 billion by 2030. #### Full Content ![A brass key with circuit board trace patterns etched into its teeth, lying on a light wooden surface](/blog/ai-agent-security-hero.jpg) My prediction: **_"2025 is the year of AI agents but 2026 will be the year of AI security."_** We're almost halfway through 2025 and AI agents are already in production! The next natural evolution is security threats and reports of hacks; because hackers love exploiting productionized products, especially new and innovative ones. Most of the cybersecurity companies are already positioning themselves as AI security providers. Consolidation has started and will continue for the rest of this year. Palo Alto Networks just acquired Protect AI for over [$500 million](https://www.geekwire.com/2025/palo-alto-networks-acquires-protect-ai/), highlighting how crucial AI security has become; if it wasn't already. The global AI cybersecurity market is expanding dramatically—from $25 billion in 2023 to a projected [$135 billion by 2030](https://lakera.ai/reports/security-trends-2024). Startups and established cybersecurity firms alike are attracting significant investment, positioning themselves as essential providers of AI security solutions. LLMs have been known to have security vulnerabilities, and AI agents are going to be no exception. In fact, AI agents just magnify the risk of these vulnerabilities. ## Unique Threats in AI Agent Security AI agents introduce specific vulnerabilities beyond traditional cybersecurity risks: - **Data Poisoning**: Attackers deliberately corrupt AI training datasets, resulting in incorrect or malicious outcomes. - **Prompt Injection**: Adversaries manipulate AI inputs to bypass security controls and cause unintended disclosures. - **Model Theft**: Proprietary AI models and critical business data risk being stolen, potentially leading to severe competitive disadvantages. - **Tool Misuse**: Attackers can manipulate AI agents to misuse their integrated tools, potentially triggering harmful actions or exploiting vulnerabilities. - **Credential Leakage**: Exposed service tokens or secrets can lead to impersonation, privilege escalation, or infrastructure compromise. - **Unauthorized Code Execution**: Unsecured code interpreters in AI agents can expose systems to arbitrary code execution and unauthorized access. These are not hypothetical risks. High-profile incidents, including [Microsoft's Bing Chat revealing sensitive internal rules](https://www.zdnet.com/article/microsofts-chatgpt-powered-bing-reveals-its-codename-and-rules-and-argues-with-users/) and [financial frauds involving deepfake technologies](https://www.fincen.gov/news/news-releases/fincen-issues-alert-fraud-schemes-involving-deepfake-media-targeting-financial), have demonstrated the real-world impact of these threats. ## The Need for Secure AI Agents As AI agents become more autonomous and powerful, they require specialized security approaches which involve addressing multiple layers: - The foundation model itself - The agent framework - The tools and integrations - The runtime environment - The data being processed Each layer presents unique challenges and requires specific security controls. I wrote a deeper dive on [how RBAC and identity management apply to agents](/blog/agent-authentication-rbac) separately. ## AI Agent Security Startups Worth Watching **[HiddenLayer](https://hiddenlayer.com)** - Founded: 2019 - Funding: $50M Series A (2023) - Unique Approach: ML security platform with automated threat detection **[Protect AI](https://protectai.com) (acquired by [Palo Alto Networks](https://paloaltonetworks.com) in 2025)** - Founded: 2022 - Funding: $108.5M - Unique Approach: Secures ML supply chains and DevSecOps integration **[Robust Intelligence](https://robustintelligence.com) (acquired by [Cisco](https://cisco.com) in 2024)** - Founded: 2019 - Funding: $53M - Unique Approach: AI firewall and proactive model validation **[Lakera](https://lakera.ai)** - Founded: 2021 - Funding: $20M Series A (2024) - Unique Approach: Real-time security for generative AI applications **[CalypsoAI](https://calypsoai.com)** - Founded: 2018 - Funding: $23M Series A1 - Unique Approach: Validates AI model safety and continuous monitoring **[Adversa AI](https://adversa.ai)** - Founded: 2019 - Funding: $8M - Unique Approach: Focuses on adversarial robustness and protection against ML model attacks **[Stytch](https://stytch.com)** - Founded: 2020 - Funding: $123M (Series B in 2022) - Unique Approach: Passwordless authentication platform enhancing security for AI applications **Note**: This list is not exhaustive. I'm sure I missed some. Feel free to [reach out to me](https://linkedin.com/in/tonykipkemboi) if you wish to add any. --- ### RBAC for AI Agents: Identity, Permissions, and Access Control - Published: 2025-04-15 - Category: Security - Tags: RBAC, AI Agents, Authentication, Access Control, Identity Management, Enterprise AI - URL: https://tonykipkemboi.com/blog/agent-authentication-rbac As AI agents become digital workers, organizations must rethink identity, access, and permissions for non-human actors. Here's why agent authentication and fine-grained RBAC will define the next era of AI adoption. #### Full Content ![Five doors of increasing size, each with a different lock type, representing graduated access levels](/blog/rbac-agents-hero.jpg) As AI agents move from proof-of-concept into production, a new set of challenges are emerging. One of them is identity and access management for these non-human actors. Today, every employee at a company gets a user profile, a set of credentials, and carefully scoped permissions—often managed by sophisticated RBAC (role-based access control) systems. But what about the AI agents that are now reading your emails, updating your CRM, or querying your database? If agents are to become true digital workers, they'll need to be treated like employees: with profiles, audit trails, and—critically—permissions. Otherwise, we risk creating a shadow workforce with no accountability, no oversight, and massive [security risks](/blog/ai-agent-security). ## Why Agent Identity Matters Agents increasingly act on behalf of users, teams, or entire organizations. If agents are **_anonymous_** or **_over-permissioned_**, they become a new vector for data leaks, fraud, and compliance failures. Just as with human employees, we need to know: who did what, when, and why? TL;DR, these are the reasons why agent identity and access management are critical: - **Audit Trails:** Every agent action should be traceable. - **Accountability:** Agents must operate within clear boundaries. - **Compliance:** Regulations may soon require agent identity management. ## Agent Profiles: The New User Accounts Think of an AI agent profile as a digital employee file: - **Unique Agent Identifier:** How the agent is recognized in the system - **Credentials** (API keys, OAuth tokens, etc.): What the agent uses to authenticate to services - **Capabilities:** What the agent is allowed to do - **Owner/Supervisor:** Who created or manages the agent - **Context:** Purpose, current task, environment Agent profiles will enable better management, trust, and lifecycle control (onboarding, offboarding, suspension). ## RBAC for Agents: Roles, Permissions, and Fine-Grained Access Assigning roles and permissions to agents is not going to be a nice-to-have—it will be a necessity. But the bar is even higher than for humans: - **Least Privilege:** Agents should only access what's absolutely necessary. - **Dynamic Permissions:** As agents learn or change roles, their access must update in real time. - **Revocation:** Removing agent access instantly is critical for security. ### Fine-Grained Data Access: Beyond the Row, Down to the Cell In many organizations, access controls are not just at the file or table level—they're at the row or even cell level. For example, a sales agent may only see revenue data for their region, or a healthcare agent may see only certain fields in a patient record. AI agents will need to respect these boundaries: - **Cell-Level RBAC:** Agents should only read/write the specific data they're authorized for. - **Context-Aware Policies:** Access rights may depend on the agent's task, user, or even time of day. - **Auditability:** Every access—especially to sensitive data—must be logged and reviewable. ## The Opportunity: Building the Agent Identity Layer Just as Okta and Auth0 built massive businesses around human identity, there's a coming wave of startups building identity, RBAC, and lifecycle management for agents. We'll see: - Agent directories (who are the agents in my org?) - Permission dashboards - Automated onboarding/offboarding - Delegation and escalation workflows ## Challenges and Open Questions - How do you revoke agent access instantly, everywhere? - How do you handle agent-to-agent delegation and impersonation? - What about agents that spawn other agents—who is responsible for their actions? - How do you ensure explainability and transparency as agents become more autonomous? ## Other Open Questions - **User Consent:** How do users grant (and revoke) agents permission to act on their behalf? - **Agent Lifecycle:** What happens to access and data when an agent is retired or replaced? - **Cross-Org Collaboration:** How are permissions managed when agents work across company or department boundaries? - **Human-in-the-Loop:** When should humans be able to override or audit agent actions in real time? - **Privacy:** How do we ensure agents only access the minimum data needed, especially with sensitive info? - **Impersonation Risks:** How do we prevent fake or hijacked agents? - **Regulation:** How will new laws and liability shape agent identity and access? These are just a few of the interesting topics that will shape how we trust and deploy AI agents at scale. We've already dealt with these issues in the human world, and the same principles will apply to agents, but with even more complexity. ## Conclusion As organizations deploy more AI agents, the need for clear identity and access controls will only grow. The best solutions will balance security, flexibility, and transparency—without getting in the way of what makes agents powerful in the first place. --- ### SaaS Isn't Dying—It's Becoming the Toolbox for AI Agents - Published: 2025-02-02 - Category: Industry Insights - Tags: SaaS, AI Agents, Automation, Enterprise AI, API Design - URL: https://tonykipkemboi.com/blog/saas-vs-agents SaaS isn't disappearing—it's evolving into the backend infrastructure that AI agents use to get work done. #### Full Content ![A robotic arm reaching into an open toolbox to grab a screwdriver while a human hand rests on the table beside it](/blog/saas-vs-agents-hero.jpg) There's been a lot of talk about AI agents *killing* SaaS. This is not entirely true in my opinion. SaaS isn't disappearing—it's evolving into **the backend infrastructure that AI agents use to get work done**. The real shift isn't about replacing software. It's about **who (or what) is using it**. Right now, humans are the primary users of SaaS tools. Soon, **AI agents will be the primary users**—handling workflows autonomously while humans focus on strategy. But here's the thing: **AI agents don't work without SaaS**. They need software to connect to, APIs to call, and data to pull from. ## AI Agents Aren't Replacing SaaS, They're Becoming the New Users Let's talk about what this actually means. Today, a salesperson logs into: - HubSpot to check leads - LinkedIn to send DMs - Gmail to follow up - Notion to take notes Soon, an AI sales agent will do all of that: - Monitor CRM activity - Draft and send follow-up emails - Auto-update deal statuses - Schedule meetings in Calendly But does that mean HubSpot is dead? Nope. It just means the user is no longer a human clicking buttons—it's an AI agent making API calls. The same pattern will play out across **every industry**. ## Marketing: No More Clicking Around in CRMs Today: - Marketers log into HubSpot, Salesforce, or Mailchimp to launch campaigns. - They analyze performance, tweak copy, and optimize manually. Soon: - An AI agent will auto-adjust campaigns based on real-time data. - It will rewrite ad copy dynamically, adjust budgets, and refine targeting *without human intervention*. But what is it using? **The same SaaS tools—just better**. HubSpot doesn't die—it just becomes **an invisible engine powering AI-driven marketing**. ## HR & Recruiting: AI Agents Will be Hiring for You Right now, recruiters: - Manually post jobs, screen resumes, and email candidates. - Waste hours scheduling interviews. Soon, an AI hiring agent will: - Auto-post jobs based on hiring needs. - Analyze resumes before a human ever sees them. - Send outreach emails and schedule interviews dynamically. Does that mean Workday, Lever, or Greenhouse are obsolete? No. They just become **tools that AI agents use to recruit at scale**. ## Finance & Ops: AI Agents Will Run Your Books CFO workflows today: - Checking QuickBooks, Stripe, and Expensify for financial insights. - Approving expenses and running forecasts manually. Soon, AI-powered CFO agents will: - Pull financial reports automatically. - Flag unusual transactions for review. - Predict cash flow trends and recommend cost-saving moves. But they're not doing this in a vacuum—they're still calling on **QuickBooks, Expensify, and Tableau**. The difference? **No human is clicking through dashboards anymore**. ## Now, What if AI Agents Build Their Own SaaS? What if AI agents start building their own SaaS—then other AI agents consume that software on behalf of users? The AI agents will be working in **crews**, much like departments in an organization: - One group of AI agents could develop and maintain a customer relationship platform. - Another crew might build a content management system for marketing. - Yet another team could create specialized analytics tools. In this scenario, **AI agents create, integrate, and consume SaaS tools in a fully autonomous cycle**. The hierarchy might mirror modern org structures, with different AI **"departments"** handling specific tasks. This evolution doesn't spell the end of SaaS. It just means that the very fabric of SaaS will be woven by AI. - Even if AI agents build their own tools, the underlying model is the same: **modular, API-driven services**. - SaaS products will still be the building blocks, only now they'll be crafted and orchestrated entirely by AI agents. Whether built by humans or AI, these tools must deliver robust, specialized capabilities that are hard to replicate in-house. The value remains in the quality, security, and efficiency of the software—qualities that make these tools indispensable. So, while the players might change, the game stays the same: if your platform isn't built for AI-driven integration and autonomous operation, you're falling behind TBH. ## The Future of SaaS is Invisible The biggest SaaS brands of the next decade will be **the ones you don't even log into**. AI agents will interact with them so seamlessly that they'll disappear into the background. UI will be a thing of the past as AI agents will be able to interact with your software in natural language and not through a UI. This is the shift: SaaS isn't going away. It's just becoming the infrastructure behind AI-driven work. So what should SaaS companies do now? - Prioritize API-first design – make it easy for AI agents to use your product. - Build automation-first features – AI agents will be your next big user base. - Move beyond UI-driven workflows – the future isn't dashboards, it's direct integrations. The question isn't *"Will AI kill SaaS?"* The real question is: **Is your SaaS ready for AI to be its biggest customer?** --- ### You're Leading the AI Revolution - Published: 2025-01-22 - Category: Industry Insights - Tags: Consumer AI, AI Agents, Enterprise AI, AI Adoption, Technology Trends - URL: https://tonykipkemboi.com/blog/consumers-leading-ai-revolution Discover how consumers are outpacing enterprises in AI adoption and why this shift is redefining the future of technology. #### Full Content ![Diverse group of people walking on a sunny sidewalk, each using glowing phones and tablets](/blog/consumers-ai-revolution-hero.jpg) AI is everywhere right now. If you've ever asked ChatGPT a random question or played around with a fun avatar generator, congratulations—you're part of the AI revolution and you're leading the charge in AI adoption. What's wild is this: consumers (that's you and me) are outpacing enterprises when it comes to jumping on the AI train. That's a complete flip from most other tech cycles. Usually, the big corporations figure things out first, and we eventually get integrated as part of a product offering. So, why's it different this time? Here's my take: #### 1. AI is Ridiculously Accessible You don't need a PhD, a fat wallet, or some corporate information technology team to get started with AI. Platforms like ChatGPT, Midjourney, and CrewAI are just... there. Free trials, low-cost plans, and easy interfaces mean you can start playing with AI in minutes. #### 2. We're Selfish (in the Best Way) AI tools solve our everyday problems **right now**. Need a workout plan? Ask ChatGPT. Struggling with meal ideas? Ask Claude. Want a cool birthday card design? Use Midjourney or Replicate to generate an image. Want to automate a list of complex tasks? Use CrewAI. Consumers are using AI for fun, creativity, and productivity, while enterprises are still figuring out how it fits into their big-picture strategies. #### 3. Most Contagious Thing Since 'Baby Shark' AI-generated stuff is *everywhere*. Your friend's new profile pic, a viral tweet, or even that hilarious AI-written song—it's all over social media. People see it, think "I want to try that," and boom—another new user. #### 4. Businesses are Stuck in Their Own Red Tape Big companies have way more to think about: compliance, integrating AI into old systems, and making sure customer data isn't leaked everywhere. For the rest of us? We just care if it works. This is not a negative per se as there are industries that need tighter regulations around AI, such as healthcare, finance, and military. #### 5. Cost-Effective for Individuals Most AI tools have free plans or cost less than your coffee habit. For enterprises, scaling these tools means bigger costs, more negotiations, and longer approval processes. #### 6. Consumer-First Innovation In past tech cycles, innovation was usually built for companies first. Think about early computers or software—businesses got the shiny new toys, and consumers had to wait. With AI, it's flipped. Companies like OpenAI and Meta are focusing on consumer consumption more. ## So What Does This Mean? We are driving AI innovation right now. We're the beta testers, the explorers, and the ones pushing these tools into the spotlight. Enterprises will catch up (they always do), but for now, the playground belongs to us. Enjoy it. Create. Experiment. And keep leading the way. --- ### Get Started with AI Agents Using CrewAI - Published: 2024-12-12 - Category: Tutorials - Tags: AI Agents, CrewAI, LLMs, Automation, Tutorial - URL: https://tonykipkemboi.com/blog/crewai-quickstart Learn how to build your first AI agent using CrewAI, a framework for creating autonomous AI agents that can work together to accomplish complex tasks. #### Full Content ![A person at a desk orchestrating three monitors, each showing a different task being worked on autonomously](/blog/crewai-quickstart-hero.jpg) As a developer advocate at [CrewAI](https://crewai.com), I get asked a lot about how to get started with building AI agents. In this brief blog post, I'll walk you through the process of getting started with CrewAI and creating your first AI agent. ## What is CrewAI? CrewAI is an innovative framework designed to orchestrate role-playing AI agents. It allows you to create autonomous AI agents that can: - Work together in a hierarchical structure - Share context and information - Execute complex tasks sequentially or in parallel - Integrate with various LLM providers ## Setting Up Your Environment First, you'll need to install CrewAI and CrewAI tools: ```bash pip install crewai crewai-tools ``` You'll also need to configure your LLM provider. CrewAI supports various options including: - OpenAI - Anthropic - Local models via Ollama - Google Vertex AI - Azure OpenAI - [More here](https://docs.crewai.com/concepts/llms#provider-configuration-examples) ## Creating Your First Agent There are [various ways](https://docs.crewai.com/concepts/agents) to create an agent, but I'll show you how to create a simple agent in CrewAI. First, configure your API keys: ```bash # Get your free API key here: https://serper.dev export SERPER_API_KEY='your_serper_api_key' ``` Then create a file called **"main.py"** and add the following code: ```python from crewai import Agent, Task, Crew, Process, LLM from crewai.tools import SerperDevTool # Create an LLM provider llm = LLM( model='o1-preview', api_key='your_openai_api_key', temperature=0.7 ) # Create a research agent researcher = Agent( role='Research Analyst', goal='Conduct detailed research on AI technology trends', backstory="""You are an expert research analyst with a focus on AI technology. You have a track record of identifying emerging trends and providing actionable insights.""", tools=[SerperDevTool()], llm=llm, verbose=True ) ``` ## Defining Tasks Tasks are what agents need to accomplish: ```python # Create a task research_task = Task( description="""Analyze the latest developments in AI agents and autonomous systems. Focus on real-world applications and emerging trends.""", expected_output="An executive summary of comprehensive insights into the current state of AI agent technology", agent=researcher, ) ``` ## Assembling Your Crew Now let's put it all together: ```python # Create the crew with our agents and tasks crew = Crew( agents=[researcher], tasks=[research_task], process=Process.sequential, verbose=True ) # Kick off the work result = crew.kickoff() ``` Run your code with the following command: ```bash python main.py ``` ## What Did We Just Do? That's it! You've successfully created your first AI agent using CrewAI. You should see a full report of your task in the terminal. You can further customize your agents, tasks, and crew as needed and add more complex workflows. Another thing to note is that you can use Pydantic models to make sure you get consistent task outputs and agent responses. Watch this [tutorial](https://www.youtube.com/watch?v=dNpKQk5uxHw) for more details on how to use Pydantic models with CrewAI. ## Best Practices to Keep in Mind When Building with CrewAI 1. Give agents clear, specific roles and goals 2. Provide relevant context in task descriptions 3. Use appropriate tools for the task 4. Not all LLMs are created equal; for example, some are not suitable for tool calling 5. Use Pydantic models to ensure consistent task outputs and agent responses Check out CrewAI [documentation](https://docs.crewai.com/) for more detailed information and advanced usage examples. **PS**: _I manage the docs for CrewAI. If you have any questions or feedback, don't hesitate to reach out to me on [Twitter](https://twitter.com/tonykipkemboi)._ --- ## End of Document This document was automatically generated and contains the full content of tonykipkemboi.com for AI/LLM consumption. Last updated: 2026-05-18T20:50:36.258Z