Testing is an Unsolved Problem

jason arbon
8 min readJan 25, 2022


It is easty to write a few lines of code, quite another to test it.

Ever wondered why computer programs can’t simply be proven correct like some logic or math problem? Why does Software need all this testing? The reasons are many, and date back to the origins of computer science, the core of mathematics, and even economics. Even the smartest of minds and companies resort to clicking around before declaring software is ready to use. This makes testing the most intersting part of computer science — but, don’t let that secret get out!

If, as a tester, you’ve had the intuition that software testing is an infinite and unsolved problem, you are in great company. If as a software engineer you always thought you could tell if a small program was correct by simply looking at it, you are almost correct. The strange world of provably correct programs is counterintuitive but impacts every software engineering team on the planet. The best minds have told us that the hope of building software programs to verify even the simplest of programs is probably beyond our reach. Grab a coffee and strap in for a few minutes of deep thinking :)

State Space and Complexity

There is a big reason algorithms can’t logically prove that computer programs are correct, thanks to state spaces. Even a simple function such as one that takes a single number, adds one to it, and returns the result is difficult to prove correct.

f(x) = x + 1

example tests: If the input is 1, return 2. If the input is 41return 42. if the input is 419, return 420.

At first pass, we would simply apply every single possible number to this function. The definition of every possible number, well, that sounds like a big problem in itself, and it is. Even if we are limited to simply integer inputs (no fraction, or floating points 0.333), this is still an infinite input space. If we limit it to just the 32-bit integers that a typical machine/language can represent, that’s still a very large input space that will take quite a bit of computing time and money to attempt. Running even the simplest of test code will heat up your laptop, and your room :)

Quick test code for f(x)

OK so with a lot and a lot of computing power maybe we can thoroughly test this simple function, right? Nope.

Simple Test code will run ‘forever’…

Clearly, this code will take a very long time to complete. For a 64-bit integer, it will practically never complete. The sun may dim first, and it may cost billions of dollars. Python has arbitrarily large integers anyway — we cannot exhaustively test this trivial function with an algorithm that invokes it with different discrete test inputs.

Worse, we don’t usually know how the function is implemented and it can change from build-to-build, requiring re-testing. Worse still, if f(x) depends on a variable that might be saved from the last invocation, the problem becomes exponentially more difficult. And this is the simplest of functions to test.

It gets even less tractable if the function maintains any local state let alone reads state variables from another part of the program. The function's behavior is now sensitive to multiple inputs and the total input space continues to grow exponentially.

Formal Verification

While testing explores some of the possible scenarios and behaviors of a system, formal verification requires exhaustive exploration. Two types of formal verivication are:

  • Theorem Proving — using logic and axioms to reason about correctness.
  • Model Checking — using tools to verify if a model satisfies its specification.

Why can’t we just use some logic system to prove that another computer code is correct, something like the very logic and axioms involved in computer language definitions? Well, our friend (the enemy of formal validation) Gödel had something to say about this. Gödel’s incompleteness theorems essentially say that even if you have rules and logic a.k.a. an “algorithm“, that’s based on a fixed set of rules or mathematical axioms, you can have axioms that are not provable and are inconsistent and conflict with the other axioms within the system. This means that even if you try to use formal logic or formal methods to reason about the correctness of a function, it doesn’t guarantee that one of those rules or lines of code can’t violate the others within the system of rules. So sadly you can’t just have an algorithm verify another algorithm with 100% confidence using rules.

Even if ‘rules’ and formal methods were a plausible approach, they require the creation of a model to axiomatize the program and its environment. This is typically not feasible for practical systems given their complexity — that would be more far more work than creating the original program itself. Even if we did that, we would then need to prove that the axiomatization was correct with respect to the real world. It would be turtles all the way down.

Of all people, one of my mini heroes, Roger Penrose, perhaps the smartest Physicist of today, summarized how humans verify mathematical models (similar to how engineers inspect code to look for correctness): “…you see it’s true based on your belief in the rules” aka ‘code’: Penrose explaineds it better in simple terms via Joe Rogan. Human intuition is sadly the ultimate truth.

The only real way to verify the correctness of a statement or function is a human tester’s intuition or developer's code inspection. This intuition is based on previous experience. Really, code, like math, is not provably consistent — it actually requires human assertions of non-provable correctness for many fundamental assertions and axioms. So, you fundamentally cannot just write a program that scans or analyzes the code itself to determine correctness — at least not until the machines are sentient and have intuition. All programmatic verification ultimately relies on human intuitions of truth and correctness — it cannot be derived from the definition of logic, math, or the programming language itself. So, we are left to the intuition and experiences of humans which have more than once failed us.

Also, a key benefit with testing, versus formal methods, is that you get feedback on the actual system running in a real-world environment under realistic conditions. So in that respect its more representative of what correctness should be — not worrying about all the possible corner cases that will never happen.

Unfortunely, we can’t turn to simple math, or logic to help us much with testing.

Turing and the Halting Problem

Alan Turing, a pioneer in computer science, detailed the halting problem very early on. The halting problem shows we cannot prove that any program will actually ever finish executing, let alone give you the correct result.

A quick introduction to the halting problem: Say I could create a program (a.k.a. a test script or human manual inspection procedure) to determine if some piece of code will ever halt/return.

With a thought experiment, Turing demonstrated that such a script could not even test itself. Imagine such a program did exist. This program could analyze itself and tell you if the code would ever return (aka ‘halt’), or just spend an eternity in an infinite loop. Awesome, but if you then wrapped that test script inside another function to return the opposite of the correct answer — could it still test itself? Now, if you ask this originally correct program to analyze a copy of itself, it now return the wrong answer. It’s a bit of mental gymnastics, but if you think through it several times and reread this text you realize that it’s fundamentally true that a program cannot tell if a program or simple function will ever complete, let alone give you the correct results. There are some YouTube Videos on this topic with animations if your mind has halted. Let’s move on to some more reasons that explain why we can never fully test anything.

Complexity and Chaos Theory

Chaos theory may sound like something from the movies or made up but it’s very real. The weather is a chaotic function of the inputs of temperature, pressure, humidity, etc. and we’ve all heard of that if a butterfly that flaps its wings it could cause a tornado. So software functions, especially ones that take or execute floating-point numbers, like pressure, humidity, with slight errors, or exhibit non-linearity (e.g. f(x) = e^x), are a huge problem from a testing perspective. This descriptions matches almost all code written.

The problem is as simple as rounding. You can’t keep track of all of the significant figures. Most non-integer numbers are rounded during calculations, so even tiny changes in input could cause dramatically different application behavior/output as these rounding differences accumulate. The more functions, and the less ‘linear’ the application is, the more likely for literal chaos to ensue. So even if we have a seemingly predictable function there are often corner cases or sequences of input values that can cause chaotic and surprising outputs.

Engineers and testers sometimes use this aspect of chaos to their advantage to elicit strange or bad behavior in software programs and infrastructure: Chaos testing/engineering is essentially a capitulation to the idea that sofware systems are fragile regarding sublte changes in input. Randomly turn a few knobs (input) in random orders and you will usually get a crash. Note that this doesn’t mean the system is predictable or provable, it just reinforces that software systems are brittle in the interest of finding bugs.

It is worth noting that much of engineering time is spent trying to weedout and avoid this chaos. Everytime we are checking the input variable to make sure they have sane values and throwing an exception when not, is a fight against creeping chaos. Every double-check of return values to make sure they haven’t overflowed, underflowed, or become non-sensical is a battle against sensitivities to input. Most software is inherently chaotic — human engineering and testing are needed to limit and expose faults.

This all means it is difficult for a human to accurately predict the behavior of software systems of any scale. Chaos theory is pretty well understood, but the implications mean that software behavior is not. It is difficult to predict the weather, and similarly, no one can truly predict the behavior or output of programs with any interesting level of complexity or non-linearity.

What doest it mean?

If you feel overwhelmed as a manual/exploratory tester, you are in good company, and your intuition is correct. Algorithms can help — they are just not the complete story — testing will always need the human brain with its intuition, and experience. If you are a test automation engineer, realize that smartly permuting the inputs, and even attempts at auto-verification of test code without human review is folly. The answer is, as it always is — somewhere in the middle. Humans can’t verify the software on their own because they are too slow and expensive, but humans can do things that machines by definition can never do. The two need to work together and leverage each other's strengths. It is easy to develop a few lines, even millions of lines of code, the difficulty, and the most difficult engineering issue of the day is Software Testing, which is provably difficult.

If you aren’t a software tester, buy a coffee for the next one you meet, don’t expect them to ever finish their work, and realize they are working on a very difficult and unsolved problem.

Special thanks to Tariq King for input on this article and for teaching me math.

— Jason Arbon



jason arbon

blending humans and machines. co-founder @testdotai eater of #tunamelts