lloyd.io is the personal website of Lloyd Hilaiel, a software engineer who works for Team Ozlo and lives in Denver.

All the stuff you'll find here is available under a CC BY-SA 3.0 license (use it and change it, just don't lie about who wrote it). Icons on this site are from Phosphor Icons. Fonts used are available in Google's Web Font directory, and I'm using Lexend and Lekton. Finally, Jekyll is used for site rendering.

Finally, Atul, Pascal, and Stephen inspired the site's design. And in case you're interested, this site's code is available on github.

From Brittle to Brilliant: How AI is Revolutionizing Software Evaluation
2025-07-24 00:00:00 -0700

We are rapidly moving toward a world where near-human-level cognition is available at every layer of our software. 🧠 This isn’t a far-off dream; it’s a paradigm shift happening right now, promising sub-500 millisecond reasoning on complex topics and fundamentally changing how we build and validate technology.

AI evaluation system visualization

This evolution is perfectly captured in the journey of building evaluation systems for conversational AI. Over the past decade, I’ve built three such systems at three different companies, each representing a distinct era of technology. This is the story of how we went from rigid, brittle, and expensive testing to a new world where evaluation is as dynamic, nuanced, and intelligent as the AI it measures.

The Early Days: Heuristic-Based Evaluation

About 10 years ago, at a startup called Ozlo (later acquired by Facebook), the first evaluation system I built was entirely heuristic-based. This was pre-modern LLM, so testing relied on custom databases, keyword matching, and parsing syntax trees. We’d specify an input utterance and context, and the system would check the AI’s response for certain entities or features. The same architectural approach was used again at Facebook to assess their digital assistant.

The problem with this approach is that it’s incredibly brittle and expensive to maintain. You can’t express success in a simple, human way. Instead, you have to ā€œangle around itā€ with a series of inflexible checks.

You couldn’t just say something like, ā€˜when I ask for Mega Millions current jackpot, it should return the current jackpot of the Mega Millions.’ You can’t express your success criteria in an understandable and human way. In some of these cases, if you can build the evaluation system, then you should be able to build the assistant that it is supposed to test anyway. It’s kind of a chicken and egg problem.

— From the source audio

These systems struggled to validate answers to real-time, dynamic questions, and they were never truly holistic in their assessment. They were a necessary step, but they were a constant, costly engineering effort.

The Shift to AI-Powered Grading

The third time I built an evaluation system, at my current company, the game had changed. The conversational agent itself was powered by modern LLMs, and for the first time, I could incorporate AI into the evaluation process itself.

Initially, this felt uncomfortable. Using an LLM for grading introduced concerns about cost, speed, and non-determinism. What if you get a different evaluation result every time you run it? To build a reliable baseline, I started by running every AI-based evaluation in triplicate and averaging the results to get a stable signal.

Despite the complexity, the benefit was staggering. The expressiveness of the evaluation system improved by orders of magnitude. Suddenly, we could write sophisticated rubrics for the AI to follow, much like a professor giving instructions to a team of TAs. This allowed for deep, nuanced analysis that went far beyond simple keyword matching.

If you invest a little bit of time in that grading rubric, boy can you get deep analysis. Far better than looking for keywords or analyzing in a traditional heuristic fashion… which are almost always proxies for what you really want to measure. You want to measure, was this response good? Would the user be satisfied? Was the user’s intent fulfilled?

— From the source audio

The Breakthrough: Rubric-Driven Evaluation

A recent blog post from the search engine company Exa triggered a major unlock for me. They proposed that their evaluations are LLM prompts—the tests are simply rubrics written in plain language. This is the fourth version of the evaluation system, and it’s a complete paradigm shift. The test itself is just a set of grading guidelines executed by an LLM.

This approach has clear tradeoffs. Tests are slower and more expensive in terms of compute. But the benefits are transformative:

From Evaluation to Execution: AI as the Runtime

This evolution in testing points to a much larger trend. For years, we’ve operated with a bias to minimize the use of LLMs at runtime due to cost and latency. We use AI to write code, which is then run in a traditional, deterministic way. But as the cost of generative AI plummets and its quality soars, that bias is becoming obsolete.

Is Code Becoming the New Assembly Language?

The logical conclusion is that AI is becoming the new runtime. We are heading toward a future where software systems are built to be run by an LLM, not just evaluated by one. In this world, the code we write today may become like assembly language—an intermediate layer that is generated by a higher-level process and never directly read by humans.

Eventually, the code itself might become ephemeral, generated on the fly by an LLM to perform a task and then discarded. This raises fascinating questions for the future of our industry. What does an AI-first programming language look like? And how long will this temporary phase of using LLMs to write code last before we transition to having the LLM be the runtime? The journey of the evaluation system shows us the path, and it suggests the most profound changes are still to come.