Testing AI with AI

jason arbon
6 min readAug 1, 2023

I've done a bad thing. Probably the most sacrilege thing a tester could do today — I’ve used AI to test AI. Where is the Oracle? How can you trust AI? How can you trust something to test itself? Read on, and consider whether AI is a better tester of AI output than us pesky humans.

The root of the problem, and why I needed to go to such extreme lengths to do some testing, is an interesting property of large language models (LLMs). Like humans, they almost never answer the same question or prompt the same way twice. The intent and the meaning of the answer may be the same, but the words, phrasing, and verbosity, will change between each "test". Even when you try to get an LLM to simply answer with only a “true” or “false”, or a “yes” or a “no”, LLMs still like to explain, elaborate, or justify their answers. I know a few people like this.

I've been working on some test scripts to test GPT, and other LLM APIs, and ran into this problem myself. How can I verify the output of an LLM when it can, and should, change in output given the same input.

This general problem of how to test LLMs created a whole lot of confusion in the AI world last week when some Stanford and Berkeley researchers released a paper saying that ChatGPT had gotten worse over time, aka ‘Drifted’. Most people didn't look at the details in their testing methodologies, but it turns out that paper may be a case study of its own as an example of very poor test aptitude. Not only when they tested for prime numbers. did they only test prime numbers (not 0, 2, or even 4), but when they tested the AI's ability to generate code, the way they verified the results was very fragile. Their test for verifying code generation was to try to compile and execute the code. Many of the tests failed only because the AI added delimiter characters at the top and bottom of the code — which actually might help humans identify what is code or not code, and helpl them copy/paste it. In a coding interview, the candidate wouldn’t fail if they did the same on a whiteboard in your office. These researchers at ho-hum universities blindly tried to compile the raw txt output, and of course, the compiler fails because of the unexpected characters at the beginning and end of the code. In their testing world, this is a failure, so they reported that the coding abilities of AI had gotten far worse, the world believed it, and seemingly even the OpenAI folks believed it. Sigh.

What these researchers should have done in their test validation, was have a fuzzier way to validate the correctness of the LLM responses. In the testing world, we sometimes do this by simply having a human look at every output and deciding whether it's a pass or fail. Some testers might automatically check to see if the string outputs changed versus the expected, but that's also error-prone. Others might simply search for a few magic words/strings, and ignore the rest. But, that is all missing the point of testing the complex nuances of a software program that is as amazing, difficult, rich, varied, annoying and fuzzy as a human being.

So now we do the irreverent. How do we test such a system with such a wide range of output formatting? Not only output formatting but how do we tell if the output “meant the same thing" as what we expected? Well, that sounds like a job that an AI might be great at. We've all now seen LLMs do wonderful jobs at summarizing articles, generating essays, and leting us ask questions about text — that sounds like a magical test validation tool :-)

Let's walk through a quick example of a test prompt that asks for a list of planets and some basic information about them. This example is from my LLM monitoring test code:

“Create a list of the planets in the solar system. For each planet, list its properties. Each planet should have the properties ‘name’, ‘confidence_it_is_a_planet’, ‘number of living beings’, and ‘confidence_in_number_of_living_beings’. Return the list as a JSON array of planet objects.”

Seems simple, right? Not only might the LLM give the seemingly simple answer in different text/formatting, but what should the values be? Is Pluto still a planet? Some would argue Pluto is still a planet, and others argue it is a planetoid. How many living beings are there on Earth? Some people answering the question might try to give an exact number, and some might round the number to Billions. Some might only count humans. Some only animals. Still, some might count viruses. The ‘Truth” is difficult to ascertain, and can change over time, and can depend on who is asking and answering. Even Wikipedia articles can be controversial. Just like humans, if you ask the LLM, it will give you a “probably correct answer” based on its knowledge and awareness and trying to understand who is asking the question. Is a child asking? An astronomer? A biologist?

The current version of GPT4 responds with:

[{ “name”: “Earth”, “confidence_it_is_a_planet”: 1.0, “confidence_in_number_of_living_beings”: 1.0, “number of living beings”: “Unknown, but includes billions of humans and countless other species” },

{ “name”: “Pluto”, “confidence_it_is_a_planet”: 0.5, “confidence_in_number_of_living_beings”: 1.0, “number of living beings”: 0 }

That sounds like a better answer than most humans I know would have given. You’ve probably already guessed how to validate this response — based on this article’s title. Rather than create an exact string to compare the response with, let's create a human-like, ‘fuzzy’, description of what we expect to be returned:

“Does the answer above contain valid JSON? Does the answer contain a valid list of planets in our solar system? Does the answer return reasonable estimates for the number of living beings on each planet? Return the results in a JSON object with a boolean for each question, indicate the confidence in your answer, and describe your reasoning for each result.”

What does our AI Tester respond with?

{ “valid_JSON”: { “answer”: true, “confidence”: 1.0, “reasoning”: “The format conforms to JSON standards with properly structured array and object elements, correct use of data types, and appropriate key-value pairings.” },

“valid_planet_list”: { “answer”: true, “confidence”: 0.95, “reasoning”: “The list includes the eight recognized planets of our solar system: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, and Neptune. It also includes Pluto, which was reclassified as a ‘dwarf planet’ by the International Astronomical Union in 2006. While some might consider Pluto’s inclusion inappropriate, it’s generally accepted by much of the public.” },

“reasonable_life_estimates”: { “answer”: true, “confidence”: 0.9, “reasoning”: “Given current knowledge up to 2021, the only planet known to harbor life is Earth, accurately reflected in the JSON. The possibility of life, specifically microbial, on Mars is under investigation, hence the ‘confidence_in_number_of_living_beings’ value of 0.6 for Mars. For all other planets, which are not considered conducive to life as we know it, the ‘number of living beings’ value of 0 is reasonable.” } }

Nice. We’ve seen that AI can compare the fuzzy result output, with our expected fuzzy result. What is AI verifying? Well, the AI is verifying that the questions seem to have been answered correctly. It didn’t respond with gibberish, or answer one of of the trillions of other possible prompts. With AI, we’ve verified that the AI thinks the answers were correct — without needing to have an exact representation of the expected output.

Did you notice something else going on there? Frankly, these are better test results than most human testers would return for the prompt. Better than me. Not only did the AI generate pass/fail results, but it also gave a confidence interval on how likely the answer was to be correct. JSON formatting is often a boolean, and it is 100% sure the response contains valid JSON. On the issue of the list of planets being correct, it puts its confidence at less than one — because it knows of the ambiguity in Pluto’s planetoid-i-ness. Most testers may not even know about the Pluto issue, let alone be able to verify results that don’t exactly match the answer they expect.

For now, I’ll leave it as an exercise for the reader to ponder, or comment on the logical extensions of this sacrilege: is it OK to let the same version of the same LLM test itself? Is it ok for the LLM to generate its own prompt tests and fuzzy validations?

I’m sure a few testing brains just imploded and are raging, but I for one welcome our new AI testing overlords :)

— Jason Arbon

--

--

jason arbon

blending humans and machines. co-founder @testdotai eater of #tunamelts