Kenneth W. Bingham

Kenneth W. Bingham

AI Reliability · Backend & Platform Engineer

10+ years · open to hybrid and remote

I build the reliability layer around AI: deterministic pipelines, evaluation and verification harnesses, and reproducible, cost-efficient systems. Everything below runs, is tested, and is open to re-run yourself, with no claim that isn't demonstrated.

Fixing AI's worst failure modes

AI's biggest problems right now are simple to name: it makes things up, it costs too much, and you can't reproduce what it did. Here is what I built for each, with a number you can check and a tool you can run right now, in your browser, no signup.

“AI just makes things up”

~39% → 0% hallucinated output on a tested task

The number-one reason AI stays out of production. I don't promise zero - that's the red flag. I ground answers in your data, flag every unsupported claim, and verify before anything ships. The Hardener does the flagging: paste a document and it marks what isn't backed by a source.

grounding · verification pass · let it say “I don't know”

Measure your rate → · Try the Hardener →

“AI is too expensive”

~99.7% fewer tokens on localized queries

Every token is money, and most prompts ship far more than the model needs. I cut the input: the Token Minimizer strips a prompt to what matters, and the dimensional API answers a localized question over a huge nested structure by reading only the slice it needs.

token budgeting · derive-not-store · prompt caching

Try the Token Minimizer live →

“You can't reproduce what it did”

1,593-file repo → one hash-identical artifact · ~85% repeat-input cost cut

Same input, different output - so you can't audit it, evaluate it, or cache it. bfx-ingest turns a codebase into deterministic context: the same input yields the same root hash every run, byte-identical for prompt caching and reproducible for evals.

content-addressed · deterministic · no deps, tested

See bfx-ingest, with a live demo →

“The codebase rots into duplication”

structural duplication scored, not guessed

No single source of truth means the same value drifts in five places and the model has five things to get wrong. The Dimensional Linter measures structural duplication in your code so you can collapse it to one source - the rule, not the copies.

single source of truth · structural dedup

Try the Dimensional Linter live →

Beyond the AI layer, the same engineering depth, all live:

Reliability out of a model you don't control

I get reliability from stock models by directing them, not hoping. Explicit operating directives, hard constraints, and a verification pass at each step keep a model on track and out of drift. Reduced to one number, that discipline took a verifier's hallucinated output from about 39% to 0% on a tested task while keeping the answer rate high. Not a smarter model, a cheaper and more reliable one. I focus my instance of it; the provider's model is untouched.

Since August 2025, on my own initiative, I went past using AI into how it works and how to optimize it, and I'm building an original framework: dimensional programming, representing data as derivable geometry so a model reads only what it needs. Demonstrated, not just claimed: a dependency-free API measures a roughly 99.7% token reduction answering localized questions over a large nested structure. The ideas are mine; every claim is published with a label for how well it's supported.

What I bring

Reliability around an unreliable model

The model is nondeterministic; the system around it must not be. Schema-validated outputs, retry budgets, verifier passes, and evaluation harnesses, so malformed or drifting output never reaches the next stage.

AI / LLM

LLM integration, RAG, evals, structured outputs, tool/function calling, model routing, prompt caching, token budgeting; self-hosted local models.

Backend & infrastructure

Node.js, Python, C#/.NET, Java, REST, WebSockets, PostgreSQL, SQL Server, SQLite, Linux, nginx, Docker, CI/CD, zero-downtime deploys.

Cost & latency

Route cheap calls to cheap models, cache byte-identical context, budget tokens, so frontier spend goes only where frontier reasoning is required.

Reproducibility & provenance

Content-addressed, hash-verified, deterministic replay, so you can prove exactly what a system did and recreate it on any machine.

The honest discipline

Every claim is demonstrated in tested code and labeled for how well it's supported. The bold parts stay bold; the whole stays honest.

Looking for my next role

Seeking a full-time AI engineering or backend / platform role, open to hybrid and remote. The projects here are personal portfolio pieces, built end to end to demonstrate the work above, not commercial products for sale.