AI Testing for Video Streaming

If you are human, chances are you stream video several hours a day. Streaming is now a core experience in many familiar top apps, next-generation game systems such as XBOX Project xCloud, Google Stadia, and cutting-edge streaming apps such as MLB, Disney+ and HBO Max. Testing streaming quality is so difficult a topic that one of the last Google Testing Conferences (GTAC) tried to focus on this issue — to little avail. In the old days, I worked on crude video quality testing myself for set-top boxes at Microsoft, I’ve seen the frustration of testing video quality at YouTube and on ChromeOS. and even empathized privately with several top video and cable companies in the past year. Testing streaming video quality is a difficult but important testing problem. The good news is that AI is here now to help.

Testing Today

The ‘best’ test for video quality today is literally asking humans to watch the video, and have them rate the quality of the video on a scale of 1 to 10. Seriously, in this day of modern tech, that is the real oracle for video quality. When I worked at uTest/Applause, prominent streaming companies would pay us to have testers on different devices, different networks, apps, and operating system versions around the world, all streaming the same video at the same time and we’d collect these numbers for them — for a non-trivial fee of course. This approach is called the Mean Opinion Score (MOS), to make it sound as formal and scientific as possible. It sounds like a joke, even a ‘hack’ but it is the best way to test streaming quality today.

Not only is this human approach an expensive and slow way to test streaming video quality — it doesn’t scale. When network engineers at streaming app companies change or tweak their networks, change their decoders or manufacturers build new devices or carriers build new networks— they can’t do a full human pass as often as they’d like. The traffic on the network, the variety of devices, even the types of videos make it difficult to measure the user experience and engineering teams need to understand how their work impacts video quality.

The user experience of streaming is critical to the businesses behind the apps. There can be some trouble signing up for an app like Google Stadia, and users might suffer quietly. If the search feature in Netflix isn’t as smart as it could be, the users will forgive and forget when they start streaming content. The streaming content is the ‘app’. If the video game stream is jerky, has artifacts, users don’t just cancel their subscriptions, but they take the time to post recordings of poor video quality on YouTube. In modern apps, the quality of the stream is the quality of the app. One large streaming company shared that the biggest correlation with canceled subscriptions was with poor streaming quality. Even if the company has the best content, if the stream is janky, the users will leave. Any failure in video quality removes the user from the core experience they are expecting from the app, and most of the users’ time is spent in the video playback, not playing around with their account settings.

Why so difficult?

Testing Video quality is difficult because there are many ways in which the video stream’s quality can degrade due to network bandwidth, device CPU/GPU, even different proprietary encoders and decoders. It's not well known, but YouTube encodes every video in many different resolutions and encodings to help the experience be best on the large matrix of devices and network conditions. Even with all this effort, we’ve all seen playback issues on YouTube.

State of the Art Automation

Modern test automation for video quality today consists of code that combs through every frame of a video during playback, under various network, device and encoder conditions and ‘look’ through the image pixel patterns for known types of failure modes. This all sounds like a great idea, but video quality can fail in many ways, and its full impact on the end-user is difficult to quantify.

The failure modes of video streaming range from simple blank screens to horizontal vertical lines, to freezing frames, low frame rates, and we’ve all seen artifacts that cause a part of the screen that is rapidly changing turn into a mess of lego-blocks. But knowing how all these functional issues detract from the user playback experience measured by humans can be difficult. What if the issues happen during the credits versus the core of an action sequence? What if multiple failure modes occur at the same time? Automation today can find some of the issues, but it is difficult to map these failure modes back to the ultimate oracle of quality — the end-user experience.

Subjective measures of quality are made even more difficult in that a particular video sequence, say a scene in a movie with people dancing. The video could be riddled with ‘artifacts’ in the background sky or buildings, but people may not notice because these are already ‘noisy’ areas of the screen. The human eye is also likely focused on the people, not the sky or the buildings, so the human may not notice the issues at all and give the video segment a perfect 10/10 MOS score. To be great at subjective quality, the test automation would have to understand all this, and score it!

Image for post
Image for post
Example: Human Focus Areas in Video Scenes

Training AI Bots for Streaming Quality

So, how can wet get test automation to replicate this subjective human MOS score? AI comes to the rescue:

  • Gather/generate many video clips and human MOS scores
  • Train a ‘deep’ neural network to classify video clips into categories such as ‘cartoon’, ‘romcom’, ‘action’, etc. as each has its own special quality attributes.
  • Train deep neural networks to know which objects in the video streams are interesting, we call this ‘Scene Analysis’
  • Train deep neural networks to identify common streaming failure modes, especially with interesting objects
  • Train neural networks to map all of the above network outputs to the corresponding human MOS scores

At we’ve been building AI bots to automatically watch streaming videos and replicate the MOS scores. At first, we created our own small set of internal video benchmarks and worked with crowd-sourced human judges to generate our training data. The initial results were promising, but not yet magical. There was a Spearman Correlation of .65 between what our neural networks were predicting and the raw MOS scores. The closer the bot scores are (red) to the human scores (blue) the better.

Initial Predicted vs Actual MOS scores for Various Spiderman Recordings

We later added more sophisticated features to our models that were aware of video categories. We tested videos such as The Good Doctor with video degradation across different network connection speeds using the Hulu app. This improved the Spearman scores to near 0.78 for most videos — a pretty good improvement and getting much better.

Predicted vs Measured MOS for scenes withing “Good Doctor” on Hulu

Training new deep neural networks with ‘scene analysis’ on video segments of The Good Doctor, Chicago Fire and Modern Family, the bots now collectively showed Spearman scores of almost 0.89! Pretty amazing :) Our AI bots were now almost on par with the same MOS scores that humans give the videos. At this point, the AI-based MOS score bots can now reliably be used in the place of humans!

Predicted vs Actual MOS score for Good Doctor With Scene Awareness

Being perfectionist data-nerds, we worked further still to verify the stability of our multi-deep network cascade and training pipelines for the AI streaming quality bots on larger, independent, academic sets of videos with MOS scores such as the University of Texas at Austin, Laboratory for Image and Video Engineering.

The bots held up and even performed better on some categories such as ‘Asian Fusion’ with a score of 0.93.

Predicted MOS for “Asian Fusion” Video Category

In testing, even the worst-performing categories such as “CGI videos” still had correlations of 0.75 and definitely useful in real-world testing applications.

Predicted MOS for the CGI Video Category

The bots also perform less predicably for videos of cartoons and is an area were are actively researching. Obviously this type of video might require additional specialized training and features.

Interestingly, the less ‘human’ and real-world the video, the worse the human-emulating bots are at determining video quality.

AI To the Rescue

Applying AI to the problem of streaming video quality is generating great steps forward compared with traditional manual or automated testing! With AI providing an automated way to replicate human MOS judgment scores, engineers building, designing, testing and monitoring video quality can now get near-realtime measurements of true video quality scores — not just detect specific failure modes.

Naive automation and even naive AI models faired poorly, but this is a great example of how analysis of the characteristics of the data, building a large corpus of real-world and generated training data, breaking the problem into sub-AI problems, combined with creative AI models, training, and architecture, can solve most any testing and quality problem with AI.

Ad: If you are interested in great AI-based test automation for streaming video quality in your app or website, or want to collaborate by sharing training and testing sets, email me at jason at

— Jason Arbon, CEO / Tester

Written by

blending humans and machines. co-founder @testdotai eater of #tunamelts

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store