Great thoughts on all this.
I love the thinking about extra signals being useful. It is very likely to be true, but as with all data, the proof is in the measurement of those signals being valuable :)
To answer questions of definition of state: this simple model tries to approximate state as it is seen by the user — meaning if you ask a human/tester ‘what page are you on in the app?’, their answer is the state label. e.g. ‘search page’, ‘login’, ‘cart’, ‘video playback’, etc. The trained neural networks map the elements and screenshot features to likely state labels that work well on most every application. Just like a human can eyeball a new app and say ‘thats the login screen’. This is just one part of the puzzle in having bots test apps, more to come as you pointed out (what input to apply, etc.). This was the ‘hello world’, but we’ve found we can do amazing things never before possible with just this simple model ;)
As for your Tesla analogy, which is very appropriate…the basic approach detailed in the post using only Screenshot and DOM as training input, is comparable to Tesla building Autopilot using only camera and radar data, which can deliver pretty good lane-following and parking behaviors of autopilot (e.g. Level II or III) . Like a Tesla camera and radar, the Screenshot and DOM have the great attributes of being easily human-debuggable and easily labeled signals. More importantly, they are often the only signals relevant to the human users we are working to approximate — my Dad doesn’t know or care what is happening on the SQL server, network, or which platform libraries the app is calling. And often, neither do most manual black box testers who are very useful. That said, the current implementation @appdiff does use many more signals as you suggest :) This blog was meant as a general introduction and to keep things simple with the purpose of sharing the basic approach.
As for machines ever having a ‘sixth sense’, I think it is entirely possible, and soon. A great example is the ML work around disease diagnosis or evaluation of legal cases. Often, the signals and nuances we as humans pride ourself in applying to problems is a poor proxy of the same ‘intuition’ encoded in a neural network trained on more patient/legal data that any human could possibly have read. Even if the machine can’t talk to the patient, or hear the passion in a plaintiff’s voice, the machines are making ‘better’ and arguably more fair judgements than their human counter parts. Maybe these other signals are noice, or maybe they only help humans because they lack the reams of training data, that deliver stronger signals than the fuzzy signals humans think their decisions and intelligence are based.
You also made a great observation about ‘wrong labels’ as being a possible signal that maybe the app is relatively confusing or designed poorly. We actually do use that metric to alert humans to check out odd parts of apps — just like you predict. I would just add that these abnormalities in practice are often attempts at innovation by app teams (e.g. more login sequences these days have two screens instead of one — a screen for username and a separate screen for passwords). The first instances of this seen by the networks might flag these as abnormal, but they are deliberate features/improvements :)
Great thoughts and questions, love it, you obvious know your stuff as well or better than we do :) Keep it coming! Heck, implement a smarter version with all these signals and share!