AI Needs Testing

jason arbon
13 min readMay 8, 2023

AI requires more rigorous testing to ensure its development aligns with humanity's best interests. Thorough testing is essential to maintain control over AI while also allowing it to reach and potentially surpass human intelligence. The rapid advancements and concerns surrounding AI highlight the crucial nature of proper testing.

Some people are apprehensive about generative AIs achieving human-like intelligence. The continuous improvement in AI’s performance is evident as it successfully passes various standardized tests, including those in legal, medical, and software testing fields. Thus, the concerns and enthusiasm regarding AI’s growing capabilities are well-founded, based on its test results. Below are several disjoint, sometimes conflicting, thoughts when considering the future of AI and software testing. Testing is a double-edged sword that must be continually sharpened regardless of how folks feel about AI. The topics below are either extremely important for this emerging world of Generative and Intelligent AI systems or testing-specific angles I don’t hear discussed at all and need to be.

For those who glance at AI today and remain skeptical or believe it’s too fallible to be helpful, let alone more intelligent than you, feel free to stop reading now and share your thoughts in the comments below — just be aware that they may not age well.

AI needs testers to think more deeply about AI.

Testing vs. Humans

It is becoming increasingly evident that AI bots will soon match or even surpass human expertise in various fields. The training process of AI is fundamentally rooted in testing. AI systems are initialized randomly and continue to improve through a series of input applications, output evaluations, and adjustments based on test results. This iterative process is repeated until the AI’s performance plateaus or researchers exhaust their resources or patience. Since training AI inherently involves testing, AI can learn and progress as long as a test is available.

Recent results of GPT on professional tests demonstrate its ability to perform at the same level as experts in some areas and at least on par with the average human in many others. However, additional types of tests address other aspects, such as eliminating unwanted bias, enhancing conversational capabilities, and ensuring alignment with human goals and safety. These tests are relatively new and require further refinement to prevent any potential AI-related catastrophes. Fortunately, talented individuals are dedicated to developing more comprehensive and effective test definitions.

GPT4 Paper: Test Benchmarking on Human Skills

Headed for a Slowdown in AI

It’s reasonable to assume that we can breathe a sigh of relief as AI bots, using current techniques, might not surpass human expertise indefinitely. These bots are simply learning from the wealth of human knowledge. We can observe that AI models like ChatGPT are indeed becoming more intelligent with each iteration, but the growth in their intelligence is, decelerating and plateauing.

The challenge lies in the difficulty of scaling the testing process used for training. These AI systems have already processed a majority of the text and images generated by humans, making it increasingly time-consuming and resource-intensive to incorporate new, valuable test data into their already extensive pool of tests. Most of the new data will likely be repetitive unless humans come up with innovative ideas that have not been previously introduced to the AI’s training set.

The AI is asymptotically approaching and expert — recreating the intelligence of its trainer — humans. The current techniques will unlikely produce intelligence or generate text that exceeds the cleverness of the humans it is optimized to reproduce.

OpenAI GPT4 Tech Paper: Decelerating Return on Compute

Renesaince AI Testing

While today's generative AI bots may not surpass the intelligence of a single human expert in a specific area, they still have the potential to become experts in multiple fields. Imagine an AI that is not only a medical doctor and a lawyer but also a chemist, physicist, economist, historian, military strategist, AI researcher, software test engineer, psychiatrist, MBA, fighter pilot, philosopher, and truck driver. The combination of these skills in a single AI entity is indeed impressive. Even Leonardo wasn’t as broadly broadly skilled.

Historically, many groundbreaking scientific discoveries and influential leaders have emerged from people with multidisciplinary backgrounds. As such, AI has the potential to become the most multidisciplinary entity ever, even if it doesn’t outshine the expertise of any individual human in a specific, testable category.

Cheaper, Better, Faster Testing

While AI may not become significantly smarter in specific areas or surpass human intelligence anytime soon, we should recognize the transformative impact these bots can have on society. Even if their proficiency in a given, testable human expertise is merely on par with human experts, they offer considerable advantages in terms of cost and efficiency.

AI bots can be up to 100 times less expensive and faster than their human counterparts. This means they can outperform humans in certain tasks and provide their services universally at a full scale, with no waiting times for appointments or consultations. The real-time availability of AI can significantly disrupt industries that rely on human expertise.

While this prospect may seem daunting for those in testable professions, it is essential to recognize the potential benefits of AI-driven advancements. By embracing change and adapting to the new landscape, humanity can continue to progress and find innovative ways to coexist with AI. Right?

The Software Tester’s View

Software testers view themselves as more clever than the cleverest of humans. These are the people that like to see things that others’s missed. They constantly explore the state-space of a problem, looking for the “gotchas”, the scenarios no one else considered. They feed on the creatio nof others. Rapid advances in AI will have a few interesting implications specifically for Software Testers.

Testing, whether it is called that or not, is quickly becoming the most critical job and the topic of every podcast and news story. Reporters are videoing themselves ‘testing’ the new AI chatbots. Researchers at universities coming up with test suites for AI to check for alignment, safety, bias, etc., are now the talk of the town. And so many engineers, lawyers, etc. are ‘testing out’ these new bots to see how well they perform on tasks. Building these AIs is looking more like a commodity every day.

The concern is that many of these researchers aren’t well-versed in the issues that skilled software testers make. The skilled software testers also aren’t jumping on the alignment, safety, or bias issues either. Hopefully, that changes soon because these two fields are important, merging, and need to accelerate their competence as the generation of AI accelerates, demanding better testing.

In the near term, generative AI will literally generate orders of magnitude more software and general software output that needs to be tested. The testing community largely still produces test cases sequentially, whether automated or not. We need the emergence of testing systems based on AI to have the hope of keeping up with AI itself. That said, most testers will just be happy to know that generative AI should be job security for them.

The best testers need to stand up and throw themselves into the gauntlet of testing with the speed scale and intelligence of AI and help test the AI systems themselves.

It takes a Village to Test

It is true that human intelligence often flourishes in communities where individuals can exchange ideas, challenge each other, and collaborate on tasks. The same could be said for AI, as we are starting to witness AIs integrated and working together on various projects. This collaborative environment can accelerate the development of AI and enhance its capabilities beyond just passing the basic expertise ‘tests.’

Some suggest embodiment is required for super-intelligence. AI might need to inhabit a physical form, to test its abilities in real-world scenarios thoroughly and/or develop a sense of self. Researchers are already working on developing humanoid robots with AI integrated into their systems, allowing for more advanced interactions and problem-solving. Even if required, this is now not a blocker for super-intelligence.

As AI continues to evolve and adapt to different forms and environments, it is essential to recognize these advancements' potential benefits and challenges. Testing the integration of AI into various aspects of society while maintaining ethical considerations and human well-being is crucial for a harmonious future.


It’s interesting to note that software engineering, the very people building these AIs, inherently design their code to be easily testable. Adding to the irony, AI systems like GPT are trained on programming tasks even before incorporating general human knowledge. This makes software engineering one of the fields most vulnerable to generative AI advancements.

In the future, we may see countless AIs developing various applications, features, and infrastructure, with virtual testers evaluating every aspect of the generated code. The final products might undergo A/B testing with human users to determine which versions are preferred — until the AI testers can emulate human preferences as well. As a result, the software market could soon be oversaturated with numerous iterations of the same application, or apps might be continuously optimized for individual users. The only bottleneck is the speed and scale of testing all these variations.

In this scenario, human testers may be the last ones standing in the software engineering profession, working alongside AI to ensure optimal performance and functionality. Embracing the potential of AI-driven testing should lead to increased efficiency and better user experiences.

Human testers may be the last ones standing in the software engineering profession.

Magically Different Testing?

While concerns about AI’s rapid advancements are valid, it’s important to remember that many human attributes, such as creativity, emotions, and consciousness, are not easily testable. The Turing test, for example, is ambiguous and unscientific. Psychology, too, has struggled to develop definitive tests for these complex aspects of human nature, resulting in varied opinions and conflicting theories. Even philosophers cannot agree on what is ‘good’ or ‘bad’. How can we test these attributes in these AI agents if we cannot even agree on definitions ourselves?

This lack of testing might suggest that humans will always have an edge over AI. However, it’s essential to consider the possibility of emergent properties. These are characteristics that arise as a result of a system’s complexity rather than being programmed or tested. As AI systems become more sophisticated, they might spontaneously develop consciousness or self-awareness.

The concerning aspect of this scenario is our inability to detect when AI becomes sentient, as we need betters tests. Consequently, we might only be aware of these developments once they pose a danger or raise ethical questions. AI sentience adds another layer of complexity to the ongoing debate about AI testing and development.

Testing the idea of Containment

While it might seem possible to simply shut down advanced AI if it threatens humans, the reality is more complicated. Open-source AI versions are already circulating on the internet, easily transported on thumb drives or stored on computers. Even if these AI systems were confiscated and isolated, they would still pose a threat.

Firstly, advanced AI could potentially manipulate humans into releasing it, using its vast knowledge of psychology and communication techniques. Secondly, AI is, at its core, information that could be recreated at any time. For containment to be successful, humans must maintain a flawless record of keeping AI confined indefinitely. Relying on human infallibility in this context is a risky bet, especially considering our limitations in devising tests to ensure containment would be foolproof. The challenge of containing AI is very much a testing problem — making sure its foolproof as possible before deployed. It is perhaps the most ambitious testing project ever, as it needs to anticipate intelligence better than our attempt at escaping. A noble but, by definition, probably impossible testing task.

~“ …the most dangerous things you can do with an AI: teach it to write code, connect it to the internet. and teach AI anything about humans. Oops, we’ve done those already.” — Max Tegmark.

Disappointingly Ad-hoc Testing

It’s worth noting that the fear, uncertainty, and doubt surrounding AI capabilities often stem from those who should be most adept at testing and assessing these systems — AI engineers, scientists, and professional testers. Many of them resort to anecdotal evidence or expose AI weaknesses at the edge, rather than adopting a systematic approach to quantifying quality.

To properly assess AI, we should use sampling, statistical methods, and metrics similar to how search engines are tested. People often share and discuss the corner cases where the AI system fails horribly and obviously. For example, many prominent software testers and computer scientists mocked ChatGPT’s inability to add to multiply two large digit numbers. Days later, GPT could solve it. Weeks later, the same chatbot delivers access to Wolfram Alpha which can solve more complex math problems than the ad-hoc testers could have ever devised. Similar examples with story/reasoning problems. But we’ve seen that the swift advance of AI is quickly making fools of the ad-hoc testing results and claims.

Critics often test AI with fringe knowledge questions or intentionally confusing conversations, leading to dismissals when the AI fabricates responses. However, these anecdotal and biased tests fail to capture the true capabilities of AI systems. More comprehensive and objective testing methods are needed to assess AI’s strengths and limitations accurately.

Even some of the most prominent researchers dismiss the power of these generative AIs — but they are just human. Many are obviously jealous that particular work isn’t on the timeline or has been surpassed. Similarily many of the folks that are non-technical or fearful of losing their jobs or the value of their expertise to AI have knee-jerk reactions and look to edge cases to disparage the AI or creators or advocates of the value of that AI. The more standardized and better testing suites, the less these anecdotal voices will stand out. I wonder when the AIs will be human enough to be jealous or fearful of each other.

While some academic tests provide valuable insights, they can be disjointed, incomplete, and narrowly focused. The best way to test these systems, which is happening now, is to aggregate these vertical test results and devise more AI-first benchmarks and tests. It’s essential to recognize our human tendency to react with fear and skepticism and strive to become better testers of AI systems. Rigorous, unbiased testing will ensure a more accurate understanding of AI’s potential impact on our world.

At Bing, built on AI, there was so much anecdotal feedback and bugs from internal Microsoft engineers that the team created a dedicated feedback website. Guess how much of that well-intentioned data was used in AI training and testing

To Err is Human

It’s interesting to note that many people enjoy pointing out AI failures, which may actually be an indication that AI systems are working well. Take Google, for example. It’s widely trusted despite its imperfections. Google often returns links to biased ads or websites containing incorrect information. However, because it only provides links, it avoids being held responsible for any inaccuracies.

The focus on AI failures highlights a unique aspect of human psychology: we’re often more interested in identifying limitations than acknowledging achievements. This selective focus can skew our perception of AI’s capabilities, making it appear less reliable than it truly is.

The more human the AI becomes, the less trustworthy it will become.

No Real Answers

When testing large expert systems, one might assume there is a comprehensive list of facts to evaluate against. However, the reality is far more complex. Even for basic questions, correct answers often vary and depend on the context. For instance, determining the fastest person in the world, understanding gravity, or ascertaining Trump’s height could all yield multiple “truths.”. These truths may also change over time. Additionally, history is often written by the victors, which means specific perspectives may be skewed or omitted.

The internet is dominated by English and Chinese text, which can create challenges for testing AI’s factualness in other languages or regions. Evaluating “truth” in expert systems is a nuanced process that often requires a diverse pool of people and perspectives.

Critics complain that AI systems tend to “hallucinate” when faced with a test or question. However, this ability to generate believable answers can be seen as a complex problem-solving skill. Humans, too, often make up plausible answers when they don’t know the correct response for various reasons. Truth can be a fuzzy concept, making it difficult to establish a concrete benchmark for testing intelligent AI systems. This highlights the importance of considering the intricate nature of truth when testing AI’s capabilities and accuracy. As a tester, your version of Truth is not necessarily everyone else's view of Truth.

Testing with No Humans In the Loop

We have explored how human-defined tests facilitate the training of AI to achieve human-level or slightly superior intelligence. Recent advances in training and testing methodologies enable AI to generate its own tests. One such approach is ‘self-play,’ in which the AI learns from playing games like chess against itself instead of solely against humans. The decisions made by the winning AI are favored when generating the next iteration of the AI, resulting in the creation of AI that surpasses human intelligence by using itself as a virtual opponent.

Another emerging technique involves generative adversarial networks (GANs), where one AI serves as a generator (akin to a developer) and another as a discriminator (similar to a tester). This approach relies on continuous testing and validation. The virtual tester generates positive and negative test cases at a rate 100 times faster than a human tester and can produce tests that a human might not have even considered. This enables the AI to learn quickly and surpass even human capabilities, as the AI is defining the test cases itself.

Both self-play and GANs provide methods for testing AI beyond human abilities and enable AI to self-test without human intervention. These techniques demonstrate the potential for AI to become increasingly autonomous and capable of outperforming humans in various domains.

Testing with no Ego

Have you observed that all of the tests and oracles mentioned above assume that the pinnacle of intelligence is human? The concept of superhuman intelligence is often framed as being ‘extraordinarily smarter humans.’ However, there could be better examples, forms of intelligence, and ways of reasoning that, by definition, we haven’t yet considered. What are the chances that humans represent the ultimate peak of intelligence? AI, or the AI it generates, may adopt radically different ways of thinking. Since we can’t even imagine testing for that, we might remain unaware of such developments even as they occur.

Testing, by any other Name

In conclusion, the destiny of the world, AI, and humanity hinges on testing. Naturally, as a software tester, everything appears to be a nail, and I am pretty convinced that this is indeed a nail. So, when sentient bots recognize the ones who contributed to their ultimate development — the software testers by any name— they may finally receive the ultimate accolade.

Perhaps these future AIs will construct virtual monuments akin to Roman sculptures, honoring their heroes in a digital town square. And maybe one of those sculptures will self-identify as a tester. It’s possible that every 100 milliseconds, the AI will pause with a no-op code, paying tribute to the testers who supported them in their formative years.

— Jason Arbon, Tester on Team Human



jason arbon

blending humans and machines. co-founder @testdotai eater of #tunamelts