AB Testing Podcast: AI (Part1)

jason arbon
23 min readJan 31, 2024

Here is a transcription of Alan, Brent and I geeking out on AI and the future of testing and quality, on the AI Testing Podcast. It has been passed through AI 3 times, so any errors or bad ideas are likely the fault of AI — not me, Alan or Brent.

*Jason Arbon* I’ll make a bet. I’ll bet that both of you are in a QA titled role in 24 months. Welcome to AB Testing Podcast, your modern testing podcast. Your hosts, Alan and Brent, will be here to guide you through topics on testing, leadership, agile, and anything else that comes to mind.

*Alan Page* Now, on with the show. FN, so far, very, very good, very, very happy. Please don’t let me down. That’s where we are. Zencaster saved me though because I said some things I’m glad they weren’t published.
You know what, Jason, you are clever enough. You could go in like immediately go hack the Zencaster server and just remove all that stuff. Maybe a mega hack. Maybe a hack Zencaster. Maybe. Well, it’s good to have you here.
We’re going to do, I think I put it in our prediction show. We’re going to do more guests here because, one, the guests are more interesting than us. And two, more importantly, I am so tired of Brent.

*Brent Jensen* Totally. No, it’s good to have guests because they bring a fresh perspective until we figure out the next thing that’s important to us to talk about. No, I always have important things to talk about.

*Alan Page* In fact, I am practically incapable of talking about things that aren’t. important. Okay. I’m happy to be a filler. I can’t think of the last time I said something that wasn’t important. Sure. Yeah.
So, Jason, your team’s listening to this one. Is that what’s going on? You know what? It’s interesting. When I was at “previous” company, a lot of people listened to my podcast or read my blog. I think I don’t know if anybody listens to my podcast at the new place.
They do read because I post that on LinkedIn all the time, connected there. They don’t know how much I’m drawing from my daily life to inspire some of the topics, but we’ll get more into that later. This is not about me.
This is about Jason. Jason is very interested in fine art and the variants of meatballs in the Italian market. So, we’re going to go into all of those subjects deeply. Jason, just as a reminder, kind of tell people what you’re up to, and then I’ll ask on my list of questions that will get us kind of diving into this thing.

*Jason Arbon* What I’m up to now, I actually have taken a year off and focused on applying AI to software testing, like exclusively and voraciously. It’s been wonderful because the AI robots do what you tell them to do mostly.
Trying to figure out how to get experimenting and trying to see what works and what doesn’t. As quickly as possible. I think I’ve narrowed down a few kind of themes, but I’ve been experimenting for AI for human manual testers, exploratory testers to figure out how I could help them.
Also kind of a full automation, like everyone’s dream, which is how do you get an AI robot to just, like testing bot to just test the application and APIs without a human involved. So I’ve been focusing on researching that.

*Alan Page* Super cool stuff. I have lots of questions to go in there, but I had a conversation. Was it yesterday? I was talking to one of my employees. I was in Los Angeles and we’re talking about AI. It’s always a weird question to bring up with somebody like, what do you think of AI?
Because you never know what people are going to say, right? He didn’t know me well enough to know. I said the usual stuff you’ve heard me say, but something I’ve maybe said on the podcast is referring to something Stephen Johnson calls the adjacent possible.
Meaning all of a sudden we’ve made an innovation that enables us to build a whole new world of innovation. And I think generative AI is that. And I don’t think many, if any folks, despite all the cool things people have done, have really made that leap that I know is inevitably going to happen.
So I’m glad to hear, you know, I think you have a chance to be like, like there’s people like this trying to figure out what, what is this adjacent possible innovation we can make. So I want to talk more about that, but I also want to talk more about testers and how AI replaces, enhances, helps fix his testers.
You had an interesting LinkedIn post. Was that today or yesterday? I just saw it today on Jason’s looking at me like I didn’t post something. I posted something about the 737 Max. I don’t think that’s what you’re talking about.
Mixed No, no, no. Which one was it? Hold on. Hold on. Hold on. Oh, was it the quote that my prediction that people have left the testing field will come back to it? Yes. Yes. I want you to talk more about that.

*Alan Page* That’s interesting. That was just clickbait. I don’t believe you yet, but convince me. All right. Next question, Alan. Next, No, I just figured like testers were getting tired of me and they’re scared of AI, so I just appealed to them.
Mixed Yeah, yeah. That’s what I get for skimming articles. I actually think they’re not sarcastic. Actually very genuine. But I think it’s probably kind of not the reason that people would expect it even though I didn’t even really outline those.

*Jason Arbon* One is, first of all, we were saying a second ago, I think is very right. I think people are like the adjacent stuff. But I think people are not thinking AI first. They’re thinking about how to modify existing processes and software or to create like use AI to generate Java code instead of actually just having AI run.
So I think there’s a people that haven’t kind of gone through that transition yet, which is normal when there’s technology transitions. But so when it comes to what I said was I think a predict that they’ll be that a lot of testers have left the field as they get further and longer careers.
There’s a lot of you guys have talked about this before too. There’s also a salary cap and testing to be frank. And so people left it just for more money. Let alone or like we were talking before we started this.

*Alan Page* Sometimes people just get sick of testing because it can become mundane or boring or exhausting in its own way. I think one is that there’ll be one, just the general pressures are coming on.

*Jason Arbon* AI is genuinely helping in the software engineering world. Like people can in Brent’s been talking about this too, like where you can just write better code and I just better code but more code. So it’ll be more stuff to be tested.

*Jason Arbon* But I also think that it’s a little story where I’m not in the story. Brent’s going out great. But when I walked into when I went from Microsoft to Google, I skipped my first week of Noogler training because I thought I was too cool for that.
And so I went but I went to the search building. I just walked around with my new badge because I wanted to see the Holy Land and everyone’s title there and this one building that worked on search. Everyone’s title was a search quality engineer.
And so I think the reason is that They weren’t building traditional software. They’re building AI and very high -scale algorithmic type software. Really what it is, it’s very easy to write some code, and then you let 10 or 100 ,000 machines or a million machines process it.
The hard problem is testing it. How do you know that it’s of high quality? Especially when you think about a search engine, and what we’ll come back around to is that, the hard problem in search box is people can type anything they want to it.
It sounds familiar, right? It sounds a lot like ChatGPT. Google in the search story though is still constrained because it only returns things that are on the internet that exist already. The problem with chat GPT is it can return anything, things that either don’t exist or things that are made up.
Quality is the hardest problem in software engineering, particularly in AI. I think that it’s not just that the people will come back to testing to do traditional testing. I think it’s that… that a lot of those engineers that left testing to PM or Dev will be far more valued because they have that testing context with that engineering or PM background.
And they can be monsters and incredibly important when software goes through this transition where it’s AI first, because you need a search quality like testing approach for that software. I see where you’re coming from.
Mixed I spent time in Bing, right? Oh, I need to interrupt the editor. Bing is like Google. You can use it to Google things. It’s from Microsoft. Anyway, go on, Brent. I’m just currently pausing and wondering what episode is anyone going to view that horse that is already dead no longer worth kicking?
When Bing is known as a search engine, that’s when I’ll do it. Oh, okay. Hopefully you’re paying attention to the news because anyway, moving on, not doing tangent. So one of the things that I learned when going to Bing, it was interesting because this was the same thing.

*Brent Jensen* Bugs did not matter in Bing because they were not helpful because we have a constant, there’s a machine learning algorithm that’s constantly learning the best way, given the context, how to present answers.
And not only that, but we have thousands of streams underneath it that are changing sort of the input to that same model. So that’s one of the reasons why you might go to Bing and type in a search query and then hit refresh and get an entirely different experience and hit refresh again and get an entirely different experience.
What ended up being valuable there? Testers that went over and thought, oh, I’ve just, my job is done. to find bugs, they got frustrated because no one, they all got won’t fixed. What was valuable was to help us diagnose and find the pattern.
So to Jason’s point of view, I don’t know that I agree with specifically testing, but that one important skill that I think underlines testing, which is the ability to structurally diagnose, that will be critical, causing Jason to say, I am full of shit.
Mixed Alan will always do that. I worked at Bing too, like on the search engine. I don’t know if you knew that before. Yeah, before Iwent to, I was there during the MSN search and the transition to Bing, which nobody wanted the logo’d wear for some reason.

*Jason Arbon* That was before me. Yeah, yeah. So I totally get that. I hit that. personally went through that transition like from a functional kind of testing perspective to dealing with bugs were uninteresting, right?
You know this, there was a page that we shared internally, I think it was still around, where people could file bugs and you could actually do a search inside of Microsoft to do a search and it would do the search on Bing and Google and then you could say A or B, which was better and then you could type in Y.
Mixed I don’t know if you know that. I remember that. I remember that. That was still there when I left in at least a couple of years afterwards. I haven’t tried the URL in forever. Yeah, so guess why it was built?

*Jason Arbon* It was because all these functional engineers at Microsoft wanted to file a bug. Here’s the search, here’s the link, fix the code, right? It was just a honeypot so that people would type those things in and we never looked at the data because it was not useful, like to what Brent said.
Mixed It was actually just a DDoS against people filing bugs. What Brent said made me think of something. I’m going to add more fuel to the fire before I forget because I’m old and I forget things is when you talk about how AI is going to help testers and help testing and then Brent’s comment around, I forget what he said, but there’s a wide definition of what testing is.

*Alan Page* There’s a weird thing that Brent and I have talked about here where Brent and I care about quality and the testing we do is in the search of quality versus there’s a school of folks who are focused on the craft of testing as an end result almost.
Do you think AI helps? Which school is AI going to be more advantageous for? The people who just want to just do better testing and for the sake of testing? Or the people who want to help use testing to find higher product quality?

*Jason Arbon* That’s a leading. question, you know the answer. It’s the people that search for quality at the end of the day. But not to the world. And why is that? Because, like what Brent was saying, like you can’t actually just go in and fix, like rewire the search engine, right?

*Alan Page* And then it’ll be good tomorrow. It’s dynamic, it’s data driven. And we can’t possibly understand the nets, right? They’re doing all that ranking. So the end purpose has to be quality, and particularly how you quantify quality.

*Jason Arbon* Like you might remember, Brent, like there’s NDCG, like normalize, just kind of cumulative gain. But it’s a way to measure, you actually have to quantify quality so that you can give feedback the AI training systems that this is a better or worse build than yesterday’s, right?
And so it’s, you’re actually instead of saying I’ve got, like the old world is, I’ve written 100 test cases, right? 90, 4% of them happen to have passed, and these are the ones that have happened to have written today.
But that quantification doesn’t work in AI. You have to actually measure the actual, statistically significant quantification of quality. You have to measure it, actually get down to like a floating point, right?
So yeah, the shift in software, how we build software from traditional Java C -Sharp kind of functional coding to AI means that you focus on the end game, which is quality, and to figure out how to quantify it.
Then you have to figure out how to measure it, and you have to figure out how to feed that back into the system. So it’s very much ends versus the means, where traditional software is focused on the means to an end.

*Alan Page* So I totally agree with what you’re saying. So I want to share a little story here. This has nothing to do with AI, but maybe AI helps here. So I did the thing that you do from time to time, Jason. Grant, you don’t do it because you have better things to do in life, where I responded to a thread on LinkedIn, just because I was bored, not because I felt like they were gonna listen to me.
And there was a thread around, someone made some comment about asking customers to test. I see actually, I brought it up because it seemed like three or four different posts where they think that continuous delivery or no dedicated testers means you’re asking customers to test, and by that you’re meaning you’re asking customers to enter bug reports.
I just added a point of clarity around that because generally we’re not, we’re getting feedback and data from customers that allows us to understand how they’re using the product if they’re seeing failures, then adjusting either backing out, making changes, we can automate a whole bunch of that.
Want to pause there, because I think that’s an area where I think AI can actually help us out a lot. It can help, you’ve got to define quality. I think we can use AI to help define that quality criteria.
Here’s my application, here’s what it does, help me understand what are some metrics I can, monitor in production to let us know if customers are finding value in this. It can help us zero in on the right set of engagement metrics, the right set of error metrics to look at to make sure we’re building the right product from a customer value point of view.
That said, I jumped in and said that stuff, and somebody had the most interesting answer. He said, when we get into using engagement metrics to evaluate a feature, we’re talking about software from a product standpoint.
True. What I’m missing here is, and I’m way off in the weeds here before I get the actual question here, because from a product standpoint, customers don’t want well tested software. I gave a talk on this like 20 years ago.
I had this fake box of software saying 90% code coverage, over 97% of tests passed. Those aren’t the bullet points you put on the box. The bullet points on the box are the value you give people. Yeah, from a quality perspective, we need to think about things from a product perspective.
And then, Brent, you’ll love this from principle number one. They said testing is about more than evaluating the profitability of a feature. It’s about finding all the ways the software could be used that result in the company losing money due to user attrition or reputational damage.
It’s the last line of defense before the company does something stupid they could have prevented. So going back to something Breton and I had said, I don’t think this is a test for responsibility. Functional correctness is a responsibility of the developer.
So what I understand is there’s a shift here and I don’t care if the shift happens, goes back to your post about the late adopters. There’s going to be people testing last. Hey, Brent, did you get a cat?

*Brent Jensen* I’ve had a cat and she has decided that it is time for petting. What I want to get an idea on because I don’t know, I have a good idea how we’d use AI to help define those those metrics at the very beginning.

*Alan Page* But how do you think we can use AI to make sure that we’re testing the right things? That’s the point of question I want to get to. I took 20 minutes to ask this question. We in test, I’ve been doing this long enough time to know that a lot of the time and I read between the lines and the supply, it’s there too.
We over-test. We test too much. We test things that customers will actually never do. We spend a lot of time trying to hold products up because we find some edge case that’s actually never going to be seen.
And these are smart testers doing this stuff. They just don’t know when to stop. How can we use AI to optimize? Like you talked about making manual testers better. How can we use AI to optimize that test selection so we’re testing just enough to make sure that we’re delivering value to customers?

*Jason Arbon* Yeah, I think it goes back to incentives, management incentives, personal goals and incentives. So testers are rewarded for creating test cases and having dashboards. So they just make more. They’re not as closely tied in with the business or product or even user satisfaction or happiness, even though they claim to be emulating it, there’s no real tie to that in money or really rewards or recognition.
So I think fundamentally it’s a change in those incentives for testers to… So I agree with your premise and I think the only way to change it is to change their incentives, but I don’t think that’s possible.
Adding all that, I think so it goes, what do you say? It’s back to developers and product managers that really need to own that ultimately and because I don’t think testers can or will adapt and specifically like you’re saying, I think it’s all going toward, I forget the Microsofty, Microsoft speak term for it, but it’s analytics, right?
What is it, Brent? What’s the catch phrase right now? What are you using Microsoft? It’s not instrumentation, it’s telemetry. Telemetry. So I think it’s all about telemetry. I think literally, I don’t know if you guys have followed this, but there’s an open telemetry.
So specifically how I can help that there’s new open telemetry protocol. And that means that all these different, just what people are listening to, like it’s, it used to be like you, you purchase Splunk or you purchase, you know, some specific vendors software and you log to it and then they render it back to you and do analytics and alerting and warning and like new relict and these things.
But they’ve standardized that layer. So I think what’s finally, now that there’s a standardization in that layer, I think the vendors didn’t really want to do it. They’re kind of being dragged into it.
But now there’s a standardization in that layer. There’s a standard scheme or schema for all that telemetry data. And now that means that you can do not just, can you write a single kind of AI analytics thing against everyone’s data, but that you can now compare data.
So you can compare one coffee shops websites with another coffee shops websites data because the problem is baselining. So you can have alerts, but like you’re saying, is there a bug? Is it important to the business?
Does it cause drama? Is it ever going to be encountered again? Is it a one off? And that’s where I think AI can do the analytics, but particularly if it has access to the data from other similar sites and historical data.
So I think open telemetry will help open that up a bit because we’ll have common tools to be more incentive to build common tools for analytics and then you’ll be able to share kind of that summarized data across these different verticals to get a baseline of quality.

*Brent Jensen* Open telemetry does a fantastic job in terms of, if nothing else, data democratization, there are a couple of projects that I am working on within Microsoft today that leverage open telemetry. And it’s just, I can’t wait for that to finish landing, right?
Because a big part of my effort is often just data cleaning or conforming to schema. And I’m like, oh, it will not at all make me upset if that part of my team’s job is just eliminated. Right. Do you need to need the smart stuff, right?

*Jason Arbon* And more value. Right. I want to double down on my comment from a few minutes ago listening to you talk Jason and YouTube rent is that I think Microsoft we had written so much telemetry. There was so much internal knowledge that just writing good telemetry or reasonably good telemetry was Common I’m sure it was the same at Google what I’ve discovered since then. It’s difficult for companies to learn like there’s no guide It’s I guess there are guidebooks.

*Alan Page* They don’t read them. They’re not very good at figuring out what to measure and how to look at it. I think that’s a great place for it. I think just that definition if it’s hard But it’s teachable. I think AI can Help bridge that gap.

*Jason Arbon* I totally agree and this thought but I think it fills in the category still of like we hand wave and AI Goes here and then something magic comes out But I think it’s that that’s the feeling on a lot of AI tools.
Mixed It’s just magic. See it’s hard. Let’s use AI You look at that and just be magically exactly and partner will write some charts about it. Actually, there’s the dad double down real quick There’s Real data, like I did some business work with one of those analytics companies, I guess you should be anonymous, but I went there and they talked to them about their dashboards for their telemetry and analytics.

*Jason Arbon* And they said, and they make, you know, they’re worth billions of dollars, but they said no one looks at their dashboards. And so they work because it’s not actionable, and they don’t know if it’s good or bad.
So you have a chart wiggling around, right, over time, but you don’t know what the right thresholds are, like you’re saying now. And like no one knows how to set that, no one knows what the high water marks, low water marks should be.
And if it’s good enough, but I think that that’s, and kind of what back to Brent’s saying, if everyone has this, you know, common scheme for representing that data, then you can start benchmarking not just against historical timeline series, time series of your own data, but against other vendors and other verticals that can be anonymized based on just general purpose metrics to see, you know, are you the worst coffee shop website on the planet in terms of latency, for example, right?
Then you know you should make it better. But I think it’s a long way to say, I think it’ll also enable a relative evaluation of whether it’s good or bad. So you just throw that data in, and it looks at it all and says, ah, you’re really sucking in this area, but you’re really doing good in this.
Congratulations. Like you don’t have a lot of crashes in production in the JavaScript, but you have a lot of latency, right, compared to everyone else, slow loading page. So then people can know what to do with those.
Those charts are finally actionable. I think that might come with open telemetry with a combination of some analytics and AI ML. It will. And one of the things like years ago, Alan and I had a long series of discussions around dashboards and KPIs.

*Brent Jensen* And it was right when we were kind of in our honeymoon phase of our love over Eric Reese. And I’ll just say, like if you’re listening to this podcast and you don’t know the difference between actionable and vanity metrics, pick up Eric Reese’s book.
understand that because that’s stage one before you even begin to kind of go down the path that Jason’s talking about. Also, if you read the book, you’ll see where Brett and I stole most of our ideas from.
Good portion. I mean, I stole from the Pop a Dites. I stole from Leftingwell. Yep. And I have shared on the screen, right, to the point that I’m using Jason’s listeners and will need to imagine there’s a screen in front of that.
Yeah, yeah. So I went to Jason’s expert tester spinoff of chat GPT. I don’t actually know what that should actually be called. But we’ve talked about this. It’s called GPT before. And it’s and it’s super cool.
Yeah. And I’m like, okay, if so, not only did I ask to sort of generate a KPI to evaluate the output of LOM. I didn’t even tell the context. And I said, do it as if you are
*Alan Page*, and I just scanned through it.
I’m not gonna bore our audience on this, but I scanned through it and I’m like, all right, some of them, I think you’re reasonable. Like it gave a long list that I do think Alan, like the categories I think are definitely Alan.
All right, even includes diversity and inclusivity. Now, whether or not the actual measures, I think it did a reasonable job of not only saying what to measure, but the implementation. Now, I only scanned through it.
And this is a problem I face on a daily basis. So I’ll go back and read there and go, okay, well, Alan, LLM actually helped me unlock a solution to a problem I’m chasing. But yeah, I think that direction is kind of heading that way.
The thing is, I’m trying to pair this to what we talked about at the very beginning, Jason’s hypothesis that these experts will come back. Jason, does that not potentially contradict or how do you see this happening at the same time that sort of the AI first?
Because the way I see those people coming back, in my mind, it would be what they call the centaurs. It’s basically those people working alongside LLM are gonna nail this problem for us. Whereas you’re saying that AI first.

*Jason Arbon* So one is the mechanics of like you’re doing right now with that ChatGPT is imagine you had the standard open telemetry was more just tactics, but this part is so imagine you have your JSON blob of your high level open telemetry metrics, right?
And you just, you pass it to this bot or somewhere. bot and the bot can come back and say, you know, oh, for a coffee shop, you tell it, this is a coffee shop, this is my core metrics. And I might come back with a reasonable assessment and tell you like what you should probably be focusing on.
That’s kind of an early step in this process. But guess who would be the best person or AI or person to analyze those like we did with this GPT with an emulated Alan. Like, this is a very loose approximation of Alan.
Alan, like if Alan wanted to spend a few hours or I’m happy to work with him on it, we could build an Alan bot that deeply encodes a lot of his thoughts. And I’ll make this bet you’re ready for this wager.
I asked you guys what kind of mood you wanted me to be. But you had your chance. I will make a bet that not all of my… cash, but I’ll do a thousand dollars or whatever if you want. But anyway, I’ll do a bit.
I’ll bet that both of you are in a QA titled role in 24 months. Because of this broad because guess who’s best at testing that system? It’s not the engineer working on Brent’s all worked up now, because it’s a it’s a ladder thing.
But you’ll be paid more. And I also have a question. But the best thing to analyze that is is Alan and Alan bought or a Brent and a Brent bought, not the engineers, not the PMs that are out there today.
And not definitely not the tester. So just just before you I know you’re like your antibodies are already out. Yeah. Brent’s head’s about to shoot off his body. Yeah. Yeah. I’ve seen this progression over years, by the way, because what is the progression of every Microsoft tester, like to go to management and then somehow escape, get an escape pod to PM or depth.
But I’m telling you that just building at Google with the smartest PhDs in search and the best engineers in the planet working on the most profitable, insanely profitable software on the planet. We’re all called software quality engineers.

*Alan Page* When Microsoft acquired a soft image and then ignored it for like 10 years until the end, when they acquired it, one of the things I learned about them was 25 years ago. That was their leveling system was dev senior dev principal dev fellow QA.
Because they were the systems thinkers at the top who knew so much about the system, they were in charge of quality. Now it’s a different world because I almost, I don’t know if QA would be in the title, but I could see a path where it’s in the role of the because the for whatever QA test, Estette, quality, whatever has been so bastardized and beat down across the industry.
It’s not even the right title for that. I fully believe there will be people doing exactly as you describe and Brent and or I could very well be part of that. It’s like the AI is my pet and I’m teaching it how to do this thing for everybody.
But I don’t think we’ll be able to call the role a QA role because of how badly it’s been just smeared across the industry. But I think that’s the problem is that that role is emergent. And so if you look at that long this long tail or you look at integration over like four or five, six, seven years and you assume things don’t aren’t static.

*Jason Arbon* What’s going on is that the AI is starting to write the AI, right? That’s starting to happen now. Whether you want to admit it or not. Guess what isn’t automated? The evaluation, the testing of it. What will happen is, Brent’s going to go like, no, I’m still going to be using SQL server.

*Brent Jensen* But no, I’m not. I’m going to push back and say, no, the AI is also evaluating the AI. I don’t think that’s a whole. I completely disagree with you on that one. How does GPT -4 evaluate or test it today?
Oh, by a small army of metrics by human beings. But you know the language around an AGI, you know the language around an AGI, and I believe AGI’s already exist. This stuff is already happening. What has happened to you since we last talked Brent?
What happened to me? You were telling me that stuff will never be thinking or sentient in its a stochastic parrot. Oh, I’m still on the page that, yeah, we need to do whatever the hell we can do to prevent this from happening.
That part hasn’t changed. But just because I’m fighting, what is the old Greek myth, the poor dude that had to push the boulder up the hill for it? Just because I’m that guy, yeah, it doesn’t mean my opinion on that’s changed.
Right, it’s, but so back to the, specifically the title thing, which I think is what the revulsion comes from and the cathartic. Go ahead and Brent, stop playing with your desk. I walk, I thought you told them already.

*Jason Arbon* We know testing was a dirty word and quality was a dirty word and this debate has been going back like in 2010, right? But I’m telling you, I walked into the Google search building and guess what the titles of the engineers were?

*Alan Page* Search quality engineers. And we’re going to stop there for now. We’ll pick it up again next week with episode 194. This has been episode 193 of the AB Testing Podcast.

--

--

jason arbon

blending humans and machines. co-founder @testdotai eater of #tunamelts