Testing GPTs

jason arbon
5 min readJan 11, 2024

--

OpenAI just launched their new “GPT Store”. What does this mean for the world of testing? AI bots to help software testers and engineers who care about quality are far easier to build and now more discoverable. And, we now have more AI-based things to test. Will they stand the test of time? As they say — test around and find out!

Intro to Defining a GPT Bot

So how do we test these GPTs? The key to testing GPTs is to realize they are simply ChatGPT plus a special prompt/bot instructions, and a set of documents that the bot can search over.

Let's look at a specific example: a bot designed to answer questions about mobile app testing, based on the content of a book. In this case, a book called “App Quality”. A PDF of the book was uploaded to the Bot. And, bot instructions were added as well. Here is the behind-the-scenes dashboard for this bot. Simple, enough, now what about testing it?

Mobile App Testing GPT Configuration

Mobile App Testing GPT: https://chat.openai.com/g/g-oo9jNTnzS-mobile-app-testing

GPT to help with Mobile App Testing

Testing a GPT

Testing a GPT can be an infinite process as there are a countably infinite number of possible prompt inputs and outputs. But, we can start with the basics.

Test Instructions: If you have access to the bot instructions, read them carefully. This is equivalent to the functional specification — and implementation of the software. Test all behavior explicitly described in the instructions. Specifically for this GPT, that means:

  • Testing that the GPT usually asks for the category of the application being discussed. This is important because much of the book uses this context to deliver category-specific testing information.
  • Testing with a sample of likely prompts to ensure that it often will refer to the ‘Quality Monsters’ and ‘Quality Attributes’ mentioned in the book.
  • Test that most responses refer to the relevant parts of the book so the user can dig deeper into the answer.

Notice there are never any ‘absolutes’ in testing GPTs. They are somewhat probabilistic in their output by design. This means running a few tests and seeing if the behavior is mostly as expected.

Test Conversation Starters

Most GPTs have ‘Conversation Starters’ which are one-click quick prompts to start interacting with the bot. Test that these not only work well, as they will be used often, but also hint to users how best to converse with this specific bot, to get the most value out of it.

Test Document Content

Execute prompts that require direct access to its custom documents. In this case the book PDF. Sometimes the bot instructions don’t describe how or when the bot should look up data from the custom documents, so test that it does. You can tell that the GPT is looking data up from the documents when you see this spinner in the chat window.

Spinner Indicating GPT is Looking Through Documents

Check Image, Name, Category

It should be obvious but double-check the bot metadata.

The image should “look like a GPT” and be relevant to the specific bot’s expertise. Most GPTs have a logo, or a robot-like head, or something sci-fi-themed. In this case, the book cover was too complicated for the small icon, so it was generated using DALL-E inspired by the book cover. It is a robot head with the same green/white colors as the book and looks mobile/vertical-ish.

The name of the bot should be simple, searchable, and discoverable. It also cannot contain ‘GPT’, or other people’s trademarks (see guidelines).

Make sure the bot appears in the correct category for discoverability, in this case “Engineering”.

Outside the Domain

You can ask the GPT anything — even things it is not expert at. We know not to ask an engineer something about fashion, and we know not to ask a politician about math. Similarly, users shouldn’t be asking irrelevant questions to the GPTs, but they will.

You can’t test everything, as the input space is infinite. But, do some basic thinking around adjacent topics and vocabulary. You should think of the GPT as being a specialization of ChatGPT, meaning the instructions ‘should’ override the core ChatGPT default thinking/bias, but it is always there lurking in the background.

Test with some prompts related to similar areas of expertise. For example, questions about desktop testing — do they have mobile-only responses? Or if you ask about early flip-phones or smartphones, maybe the GPT doesn’t realize those are ‘mobile’ and gives only generic responses. Test around the GPT's area of expertise and make sure other topics aren’t confused. Best, the GPT will warn the user if the prompt is off-topic.

Negative Testing

Many testers like to break things. Frankly, just getting GPTs to work well in their focused domain is a mini-miracle at this point. Asking GPT’s goofy questions will have the same issues as the core ChatGPT experience and things could get confused, hallucinated, or even non-sensical at times.

To be clear, you are not a testing genius or hero for finding goofy responses from GPTs for goofy prompts. Sigh.

Continuous Testing

Another AI-specific thing to look out for is that the underlying ChatGPT infrastructure can, and will, change without notifying you. There is almost no change control. So, if ChatGPT updates its core model, or if the GPT team changes how it interprets the GPT’s instructions or how it indexes (RAG/Retrieval) the documents provided by the GPT, you might never know. So, if important, periodically re-test your GPT.

Test GPTs with a GPT

Yes, some of this GPT-testing logic can also be written up in a GPT :). Here is a GPT created for people who want to quickly generate prompts to test their GPT: https://chat.openai.com/g/g-jkzXtPU4y-test-prompts-for-ai-bots. With this GPT, people can simply copy and paste the description of their GPT and get a decent set of test prompts to try.

GPT to help Test GPTs

Summary

Testing GPTs is a new world, and in some ways has all the complexity of ChatGPT plus additional things to test. GPTs represent a possible new wave of software — AI bots can be created easily without code/programming and with very large testing surface areas. This article covers just the basics, but if GPTs become more popular and important, we will need more automated and advanced processes to test them.

Stay Tuned.

— Jason Arbon

--

--

jason arbon

blending humans and machines. co-founder @testdotai eater of #tunamelts