Testing and Fixing an AI Bot
Without warning, while I was driving, Alan Page and Brent Jensen started chatting with the Expert Testers AI bot, on the AB Testing podcast. I was suddenly both very curious and very nervous as to what my virtual child might say when I wasn’t around. These two ‘supposed experts’, are not only testing experts themselves and able to judge the bot output — but they are also TESTERS, and I knew they’d start testing the little bot live. Good thing my car has lane assist… I learned a bit about the expectations of a chatbot, and how to fix/nudge these new AI Chatbots to be better, and sharing them here for AI posterity.
Alan started off assuming he was a testing expert. Yeah, cringe :) He asked the bot to contrast him with another expert in the field, and, it generated an answer that Alan seemed to think was OK->good. But, the bot suggested his style of testing was ‘structured’. Alan and Brent recoiled at that description, and soon realized it might be because the AI knew too much about Alan’s past life at Microsoft thanks to the book he wrote called “How Microsoft Tests Software”. Since the book, he’s had further revelations on how testing should be done — and those revelations are ‘less structured’. Nice, forgiving users, so far.
This is an interesting aspect of testing AI and knowledge-based systems. In the old days of Google Search, people would almost always first try a Vanity Query. They would of course search for their name. And in Google Maps, people would of course search for their house. When you think about it, this is a tough ‘BVT’/’acceptance’ test suite — the software has to get it right for every individual using it, on the first try. Worse, this one person is actually the world expert on this very topic! Worse still, people are very particular when any other person, or system, says things about them.
Back to this first live test. Alan and Brent seemed to balk at being described as ‘structured’. Of course, my mind was thinking how awesome it did a decent job on everything else, and probably while I was asleep. We live in interesting times and I wouldn’t have thought this possible a few years ago. Alan had asked it about Alan and it delivered a decent answer! Of course in AI, the user/tester was focusing on the nuance, definition, and implications of a single word out of hundreds. Sigh.
My mind immediately started an apology campaign for the bot. Of course, Alan's apology for the bot was probably correct, there is a lot of authoritative information out there about Alan’s structured testing views in a book written by Alan of all things, and a lot of humans probably think the same! Then my mind added the notion that the bot was also in a ‘compare and contrast’ mode between his thinking on testing and another ‘self-declared testing expert’ who advises against test cases, documentation, and automation and appeals to the most base. Basically ‘Wing It’. Any testing expert would be considered super-structured in comparison.
But, my testing and user advocate mind finally kicked in and decided, no, if Alan thought it was wrong, it is probably wrong, and at minimum wrong for him. Users shouldn’t have to apologize for the software, especially when it purports to be an expert. Even if Alan was wrong in his interpretation, because he’s too close to and biased on the topic, the ideal bot would have either qualified the response if it was due to a relative comparison or should have qualified that it might be old data that conflicted with more recent writings and rants that Alan has made. It should have assumed like any person would, that an expert's more recent statements should probably take precedence.
FIX: I updated the bot prompt to be sure to consider the timeline of each expert’s statements and generally assume later statements are more relevant and representative of their thoughts.
FIX: I also updated the bot prompt to consider whether the statements about an expert are relative to another's in a response, and if so ensure the user is made aware of any question-specific bias in representations of that expert's opinions.
Kinda cool, instead of retraining on a billion examples, reprogramming a bunch of code, or changing a bunch of rows in a database, in this new world of Generative AI, we can, well, generally tell it how we want to behave, and, it will generally do so :)
I saw the next test coming just as the light went green. The people behind me had to honk, but friendly-like. Of course, Alan and Brent have to ask the bot about Brent. Brent is not only interesting because he has a clicky-clacky keyboard, with the microphone placed inches away, and he likes to type over Alan’s thoughts and questions, but he also lives the insular life of many experts at the world's largest software company nestled in the woods in the pacific northwest — he doesn’t produce much data for the AI or anyone off-campus to analyze other than this podcast. Brent hasn’t written a book, hasn’t blogged much in the past few years, hasn’t commented or posted much on Linked In, and even apologizes for this social media presence. Worse — Brent thinks he’s a testing expert and of course assumed the AI would also think so! Lolz. He is definitely an expert in testing, but he’s also probably not on many folks' top 100 list because he is far quieter, humbler, and oh yeah, keeps telling the world (and AI) that he’s a Data Scientist, not a Tester! It's a thing at Microsoft, it became suddenly uncool to be a tester, even life-threatening. To make matters worse, he’s a Datta scientist, not a Dayta scientist.
So, when they asked the bot about Brent — I winced in my car. Sure enough, it seemed that the AI had a rough notion of Brent, but not much detail about him or his testing philosophy or expertise. The AI did know that Brent was very focused on fast-feedback loops (something he advocates for a lot), and he was obviously data-focused. Still, the detail on Brent in the responses was so much fuzzier than the AI gave for responses about Alan. Alan quickly concluded the AI was likely hallucinating, making things up, and pretending to know Brent to make him happy. OpenAI and other LLM chatbots often want to give an answer, even when they are unsure, or missing data because of their RLHF chatbot training. Brent was flattered that the AI kinda knew him, and started giggling while asking more questions. Brent probably knows he has a much smaller internet footprint than Alan — and was happy to be considered an ‘expert’ and he’s just naturally very curious.
NOTE: We may never really know the truth, but Alan also didn’t consider that I’d instructed the AI to avoid saying anything that it isn’t very confident in. And, that it is far better to not say something than hallucinate, make things up, or assume things about one of the testing experts. Maybe Alan would have given the bot more credit if he knew those were explicit instructions. Alan and Brent also didn’t know that Brent wasn’t in the ‘Top 100’ testing experts by default. Hmmm. Cringe? These are some examples of difficult testing/quality questions of the AI era.
FIX: I explicitly told the bot that Brent was a testing expert. I did this because as the curator of the bot, I can apply this bias, and I hope to be invited again back on their podcast. Probably in vain though, as success would be the bot on the podcast again instead of meat-space-me.
FIX: I also pointed to a few of Brent's blog posts and social media and made sure it knew he was the co-creator of this whole ‘Modern Testing’ thing.
FIX: I asked the AI to always include a confidence metric in each response. The % confidence is a measure of how sure the response it gave was correct. This helps in two ways — future users obviously can now tell if the chatbot, like a human, thinks it knows what it is doing, or, doesn’t and it just trying to please with an answer or an educated guess. Just like most humans, I might add. It has the added benefit of semi-chain-of-thought reasoning to force it to consider why it is saying what it is saying and use that to help ensure what it is saying is correct. (Fancy prompt engineering stuff).
FIX: I also asked the bot to explain its confidence % when possible. Again, this further helps force the bot to think through the reasoning behind the response in many cases and also gives the user more context on how to judge the correctness of the response and understand how the AI is thinking.
Lastly, Alan asked that I work for free and upload transcripts of all his podcasts to the AI bot for him. This is from the person who won’t pay $18/month for podcasting software. I did point the bot to Alan and Brent’s AB Podcast website — but not the transcripts. Transcripts are an interesting data source. Most translations would cost more than $18. Still, more importantly, simple audio transcripts don’t do well, if at all, separating the different speakers, so I worry it might confuse the subtle differences in thinking between Alan and Brent. Worse, the bot might end up assuming the comments of their podcast guests were also theirs. Perhaps Alan and Brent can share a transcript with different voices called out, perhaps they could do a little work?
So, there you go. We had some testing of the test expert bot, by testing experts. The supposed testing experts didn’t bother taking the time to create JIRA tickets, instead, they live broadcast and published all the flaws :). Testing in production is awesome! Still, we were able to infer some ‘bugs’ and make some improvements that should make most future interactions better and more contextual.
A thing most traditional testers don't realize about testing and fixing AI-based systems is that you want to keep in mind the possible impact of any change on all other queries /responses— more often than not, simple or obvious fixes, cause many more issues in areas you hadn’t anticipated.
Here is the bot if you want to virtually talk to Brent…. Alan, or any testing expert yourself: https://chat.openai.com/g/g-f802Av5xo-expert-testers
Here is the transcript of a quick chat I used to validate the impact of some of these changes: https://chat.openai.com/share/df776487-b183-42fe-a525-d9b1e0ae81f4
Cheers, and thanks for the feedback you two — keep it coming!
— Jason Arbon, TestNerd