ChatGPT Sucks for Test Automation
I don't recommend friends use ChatGPT/AI for test automation. This might seem counterintuitive coming from me. For most testing projects, the time spent learning how to practically leverage AI will probably outweigh the benefits. It might even drive you mad. For most test automation engineers, AI will be more of a trap than the promises of traditional test automation, and likely even lower ROI. That said, I personally enjoy the struggle, and happy to share some scars.
Simply Deceptive
The API appears really simple. In its basic form, it's a function that takes an input string and returns a string. Yes, there are a few optional parameters for things like the temperature (controlling the amount of ‘randomness/creativity’) in the response, and you can pick which ‘model/brain’ you want to use, but it's basically a string input, and a string output. If you want to see all the optional nits, check out the docs, but there still really aren’t all that many parameters of interest to most users.
This API seems amazingly simple, right? But the problems are just starting. Just like the manual tester that wants to ask ChatGPT to create some test cases or some data variations, the test automation engineer, is presented with an endless possibility of text inputs. The AI API, like humans, is ironically the simplest, most intimidating, most powerful, and most dangerous API ever created — all at the same time. Text automation engineers often try a few things at first:
Baby Talk
Just like ChatGPT, automation engineers can send in a string that asks for the AI to generate test code. Sounds like a good idea, but things quickly go downhill from there. The test automation engineer will usually first ask for test code to click a button or enter some text. The API will oblige, but it will either hallucinate selectors or just say ‘smart selector goes here’. Same with the URL to use when launching the application. This is not that helpful. The test engineer still needs to figure out what the selector should be (same brittle ness now as traditional automation) and fix up this code.
Worse, the generated code will have all the import statements and even initialization of the framework to be used, so that the code could be copy pasted into a *new* file and executed — but the test developer already has code that doesn’t need this boiler plate. Oh, and most tests require some setup code that the engineer has already written somewhere — but the AI doesn’t know about it. Though an amazing demo at first glance, the AI will give simple answers to simple questions and it doesn’t know the context of the engineers’ codebase or project. It is often more work to ask for code than it is to simply code it up yourself. Really.
Overcommunicate
Persistent and clever test automation engineers will then try to stuff as much context as they can into the input string, a.k.a. "prompt". They'll stuff the selector values, they will tell it to insert a magic setup function at the top, maybe tell it not to instantiate the framework, and often ask it to add logging functions specific to their execution infrastructure. And, this will take many iterations to get correct — but that persistent automation engineer can eventually coax the AI into doing something close to what they want.
Success? No, this was simply clicking a single button. Most test automation requires a series of actions and verifications. So the issues above compound linearly at best because there are more lines of code, which means more selectors, more logging, and more coaxing of the AI.
This process has to be repeated for almost every test case written. Yes, things can be templatized but now we're starting to look like some of the worst test code using legacy languages :-). At least the legacy languages are super repeatable and super predictable and super debuggable — with AI, you never always know what you're going to get. Our test automation hero copies the individual test cases into a really large JAVA file and runs these test scenarios in production. Was this flow more efficient than simply copying/pasting their test case boilerplate code, or creating Russian dolls of classes to specialize each test case and reuse as much setup and teardown code as possible? I doubt it.
EVAL()
Test automation engineers who fancy themselves clever might think they could just use an eval() function, and execute the generated code on the fly and avoid copy-pasting the code from the AI output into their code files. Those that dare, will find that their code may run correctly at first, but put it in a loop, or just wait a few days, and the tests will break. Worse than traditional automation — the tests will break, not just because of product changes, or intermittent infrastructure and timing issues, but because AI sometimes likes to return different responses for the same input. This is a test engineer's nightmare. With some basic math, they quickly realize that normal "flakiness” + “more flakiness” = “even more flakiness”. What if the temperature is set to zero? A value that should minimize the variability in response. The engineer still gets different responses from time to time, unpredictably. We will ignore the difficulty of debugging dynamically generated and eval()’d code — that definitely doesn't help either.
Flakiness
The flakiness from the LLMs doesn’t just change the generated code, sometimes it starts talking like an annoying co-worker while you are trying to program. Just like with ChatGPT, if you ask it for code it will give you the code — but it also likes to try to explain the code, in English! That doesn’t compile! The AI also likes to add delimiters to separate the code from these English descriptions. The reader is probably thinking — oh well, parse it out with a regex or ‘split()’. But no, sometimes it uses different delimiters: ‘*****’, ‘ — — — — ‘ etc. At some point, if the generated code gets too long, it will only generate the first part of the code and then — stop. Still worse, if the generated code has helper functions, sometimes those helper functions don't have code in them, assuming a human would know what to put in there or assuming it must be defined somewhere else. Thanks GPT.
OK, enough about code. What if we just want the AI to generate test data. We will write smart code to parse/validate the output and make it return the data in a consistent format, say JSON. It’s a trap! If you ask GPT to return the information in a JSON format, what actually happens:
- Sometimes you get perfect JSON. Sometimes.
- Sometimes the JSON will be contained in delimiters and explanatory text just like code
- Sometimes the JSON won’t escape quotes or other JSON formatting characters and the parsing breaks.
- Sometimes, when you tell the AI to make sure a property is an integer between 0 and 100 — it sets that property value to “medium”. Yeah.
- Sometimes, the darn AI just returns you a bulleted list — it likes lists. Oh, and sometimes those lists are unnumbered, sometimes numbered, sometimes with and sometimes without a ‘.’’ after the number. Sometimes the list has a header, sometimes it doesn’t. Sometimes is a comma-delimited list. Sometimes…
- Sometimes it just says “Sorry, I just can’t do this because X”, where X is any number of strange reasons, and there isn’t a finite list of them, let alone consistent strings that you could search for with a regex or substring. Oh, and if you simply retry the prompt — it works.
- Sometimes the JSON properties will be single-quoted, and json.parse() doesn’t like that.
- Sometimes the JSON is randomly truncated halfway through generation — no ending curly braces, closing quotation marks, etc.
“Sometimes” is not a good thing when you want your test automation to be reliable. You can try to tell the AI not do these things in a long list of ‘dont’s’, or even threaten the AI in the prompt, but the AI still loves to be flaky.
Fix my code?
OK, well at least GPT might be great at fixing my test automation code. As GPT has gotten smarter for humans it has become worse for fixing test code. When copying a large section of code into ChatGPT, GPT may fix your code, but will often add ellipsis for brevity. Instead of generating a full copy of the fixed code, it likes to only indicate the parts that changed. This may look pretty, but it means several copy/pastes to get the fixed code back into your project. It also means that sometimes you put that snippet into the wrong spot, or accidentally forget one of the snippets.
Worse, GPT likes to give code fixes — not based on the latest libraries or language versions. Probably because it has been trained on old data…and even then, most code it's seen was probably from older versions just because less time has been around for people to write Stackoverflow articles on the issues in the newer syntax. Tip: tell it the version of all libraries and languages you are using.
Testers are often working ‘full stack’, meaning, a few APIs here, some Python there, and maybe even a little Jinja templating too. When you use ChatGPT, it thinks you are in an engaging continual conversation, not just asking it rudely to fix code snippet after code snippet across different languages. It can easily get confused and think that JavaScript snippet is actually Python syntax or that the comment is HTML.
Still Here?
If you made it this far, you just might be cut out for the pain of building test automation with AI. It might not be sensible, or profitable, but it is admittedly fun. This is just a taste of what you will run into. Most of the work in applying AI to test automation is in figuring out ways around these, and many other issues. If you can coax the AI into doing your bidding — it can be amazing.
Such a simple API :)
P.S. …and people wonder why I’ve been less than perfectly pleasant on LinkedIn this past year!?!
— Jason Arbon