TestOps: ChatGPT
I overheard some nerds talking about the ‘flakiness’ of the OpenAI GPT APIs in San Francisco last week. Most of the public energy in the LLM world is debating the ‘intelligence’ and ‘truthiness’ of GenerativeAI/LLMs, but not enough is spent on the operational readiness of these systems. Developers building on top of these LLMs are dealing with many old-school DevOps issues such as API latency, reliability/errors, versioning, and variation across prompts. Most of the angst seems real, but anecdotal, so of course I spent a little time this weekend testing.
If the data upsetting in any way, I hope it motivates you to create far better methods, sampling, and analysis. I just know that I couldn’t find much online, and given the quality of other papers published recently on LLM Drift, I figured this Sunday evening’s effort definitely met the bar to share :)
I wrote a simple Python script to monitor the basic operational aspects of OpenAI’s APIs. It periodically calls the chat API and varies the calls across a few prompts, a few temperatures, and a few versions, measures the latency, tracks errors, and records the responses.
The latency of OpenAI’s API calls seems to vary more than I would have guessed.
Note the perf bar charts below have an x-axis of time series samples and a y-axis of latency for responses. The charts are normalized to a maximum value of 2X the standard deviation of each individual series — cutting off extreme outliers to make the overall texture of the variation visible. Read: they aren’t proportional to each other.
There seems to be significant, and consistent variation in the texture of latency across different versions of GPT. It generally seems that GPT-3.5-turbo-16k is the fastest, with GPT-3.5 about 1.5–2X, and GPT4 2–5X slower. The relative latency seems to be pretty consistent across versions, given a fixed prompt. And the overall pattern of latency being correlated with the length of the response both makes sense and agrees with some other posts on the net. The standard deviation also seems to be about half the arithmetic mean (average). This seems like a lot of variation for the same, repeated, simple call.
Most of the time that is… Some queries seem to exhibit strange inversions of this latency pattern. Some prompts containing ‘chain of thought’-like complexity are actually far faster on GPT-4 (up to 10X!). For comparison, two versions of a similar question about prime numbers have very different performance numbers.
This result is curious. Maybe it is the result of some ‘memorization’ in GPT4 from changes in training data/methods, or maybe GPT4 is somehow optimized for parallelization of chain of thought prompts? Strange.
During monitoring runs, the two most common errors were: openai.error.APIConnectionError, and openai.error.RateLimitError.
The APIConnectionError seems to happen sporadically, regardless of the network I’m on. Reproducible across home networks, coffee shop networks, cell networks, and with VPN on and off. It seems this might be an internal connection error within their infrastructure. Not sure, but it does mean that default checking for this error and retrying logic is a *must* to make things even close to operational, and adds to operational end-to-end response time for calling apps.
The rate limits seem to be inconsistently/sporadically enforced and variable. I don’t quite have a handle on this, but this exception is encountered even when running the monitoring service ‘slowly’. If ‘feels’ that the rate limit is a function of the user-level rate limits, but also a function of load in general on the OpenAI system. Anecdotally it seems that these errors occur more frequently at specific times of day and days of the week, but need more sampling to know if that pattern is real. As for coping, I simply add simple exponential backoffs to the calls, but again, this adds quite a bit of additional latency and makes real-time apps painful for end users.
Will look into more analysis of error frequencies. I hesitate to share them as I don’t have an equal sample across all prompts yet, or enough samples to catch some of the larger patterns of error timing. There is a decent chance these error rates could be prompt or other variable-dependent as well.
Thoughts. Help.
A quick open question I have is what temperature values to use when sampling. I’m currently using TEMPERATURES_TO_TEST = [0.0, 0.1, 1.0, 1.5]. This is a simplistic, uninformed sample of temperatures, and I don’t even know what the default is for the pubic ChatGPT user interface. 2.0 temperatures seem to be consistently non-sensical. Any suggestions on better values to use? I do want to keep the list short because it linearly increases time and data for the monitoring service.
What do you suggest would be the most important or interesting variables to add to the sampling? More LLMs? Azure hosted? What do you think of this early data? If this type of data would be useful to folks, please let me know — I could make the data and/or scripts public. We could also band together to build some ‘real’ monitoring on these LLM APIs to help make them better and know what issues to expect when building on top of them. Let me know.
If you want to see what these DevOps-y results look like over a longer period of time, click that like button and subscribe.
It will take more than a village to productionize these new LLMs for the world, but we’ll get there!
— Jason Arbon