How do people fix their wonky software? They reboot it.
Working as a software tester on a cable set-top box right out of school, I was supposed to test some things like the ability to change channels and that websites were shrunk down to fit on a TV. We were planning to ship millions of these boxes, but what if they had bugs and didn’t work correctly? I asked around and discovered that the average technician call when a reboot wouldn’t fix the problem, would cost 25+ bucks minimum — each call destroying a year’s profit.
I realized the most important thing to test was that those boxes could reliably reboot. Despite my assignment, and without permission, I was now on a mission to make sure reboots worked reliable, and we could save the world. Set-top boxes run 24/7 and accumulate a lot of state that could be buggy — rebooting would fix almost all issues if it worked reliably. This obviously had to be the top testing priority.
I talked with one of the firmware engineers hidden down the hall. No one ever spoke to him except at lunch. He also didn’t seem to realize most people ran combs through their hair. And of course, I was perfectly normal. After I shared my fears of failed reboots, we talked about how to automate this testing. But, a test case could execute across reboots? The first attempt was to get a custom build that let the system rest for a bit to settle in after a boot, and then call the reboot() function. Only one line of code at the right spot and Bam! we were able to get a set-top box to reboot forever. Nerd cool.
At the end of the day, I cannibalized the rare prototype device I was supposed to be testing on, and let the reboot test run overnight. After a few reboot cycles, I headed home. In the morning — it was hung! I didn’t know it would actually catch anything, let alone the first attempt. This was a tester's dream, a priority-zero bug, and they just kept coming. But, I also hit a problem. Our caveman-looking firmware engineer needed my machine to debug it! I borrowed my cube-mates machines when they weren’t using them overnight, and stole some from the lab that I noticed weren’t hooked up yet. Yes, this was finding great bugs. Yes, people were happy and impressed. But, the best feeling was knowing that I was adding value while I walked, drove, and slept. This was probably the best ‘bug return on investment’ ever. What if I followed the assigned testing plan?
Later we created many variations like full flash reboots, fast reboots, etc. We also connected debug cables to the serial ports to make grabbing crash info super easy and catch the bugs in the act. Test nerd heaven.
If your cable box ever stopped working, and you rebooted it to fix the problem, you’re welcome!
— Jason Arbon, Tester