Thursday, February 16, 2023

 

Why I don’t Myth the Old Days (or there’s no accounting for bug fix costs)


This is a commentary on a portion of Mark Curphey’s blog post “On the left, on the right, and wiggle in the middle”.

It’s not that I disagree with Curphey’s overall message that “shift-left is a dangerous urban myth”, but I think perhaps both he and Lauren Bossavit, from whom he quotes from Bossavit’s Gist post might not be considering the proper context of the presumed myth “it's 100x cheaper to fix bugs in development than it is in production”. Specifically, I think the citation that Pressman used in his 1981 book was for a completely different era of software development and I think that makes a significant difference that is not being accounted for and ever since then—because it benefits ROI marketing hype for certain companies--it’s been constantly pulled out of context and taken on a life of it’s own.I

However, in a nutshell, when Pressman originally wrote it—for the worst case scenarios at least—it probably was close to being in the right ballpark. That doesn’t excuse companies from still repeating it (whom Mark rightfully calls out) but I don’t doubt those figures were close to being correct back in 1981. So I think there’s a bigger picture that is being missed here.

Let me explain. And for those of you who have not as old as dirt, as am I, let me give you some history.

In the early 1980s, the hardware was archaic by today’s standard and the waterfall methodology for software development was the only game in town. If you were lucky, you got to work on a machine that had 16-bit addressing and either Version 7 or PWB Unix. At (then AT&T) Bell Labs at the time, it was not uncommon for 20+ developers to share a Unix system on a DEC PDP-11/70 (or worse) and connect to it at 9600 baud using DEC VT-100 terminals. Programming was generally done in C or assembly language (or often a mixture) using the ‘ed’ text editor. Compiling the Unix kernel from scratch would take 3 or 4 hours in single-user mode, but much longer on fully loaded system. The only debugger at the time was ‘adb’ and it only displayed the code in the native PDP-11 assembly language. But most importantly perhaps was that all software was distributed by 9-track tape.

If you put all those things together, it’s not hard to see how the cost of catching a bug in early in the development process could easily be a factor of 10 or more than catching it and fixing post release.

The first project I was on at Bell Labs was called AMARC. (I think AMARC was an acronym for “Automated Messaging and Accounting Center”.) Like most other projects at the at Bell Labs, AMARC was on a two-year release cycle. When we did a regular AMARC release, the would be written on 9-track tapes and delivered, often by special courier, to most of Baby Bell (officially known as Regional Bell Operating Company (RBOC)) central offices. (AMARC collected long distance billing information and charging for long distance at that time was one of the primary ways that AT&T funded Bell Labs R&D.)

But occasionally, when a bug surfaced, it was serious enough that AMARC had to ship emergency patches. I remember one such patch. AMARC was a proprietary custom duplex real-time operating system (this was before DMERT for those of you old enough to remember that) that ran on dual PDP-11/70s. The was a bug that was causing AMARC to crash and when its paired mate rebooted the crashed processor, it got stuck in an infinite cycle of reboots: when the initially rebooted mate came back up, it caused the other one to crash and then it would reboot. So there was this wicked cycle of endless reboots until each were shut down. But the bigger problem is that AMARC recorded (of to 2 hours worth) of its billing data on 9-track tapes. When the machine would crash, that tape would get trashed. So that required an emergency patch. (If I recall correctly, the Bell System called those emergency patches Broadcast Warning Messages or BWMs for short. Whatever the actual term was, I will henceforth refer to them as BWMs in this post.)

The BWMs were often generated using ‘adb’ in write mode (-w) to allow for patching binaries. Back in the day, AMARC would allocate patch space by writing large sections of NOP instructions which would latter be filled in with the actual fix and then a jump instruction would jump to the proper spot. So even if the bug was in some C code (which it often was) the patch would be in assembly code and then adb would be used to patch the actual AMARC baseline binary (all the patches were done in octal by the way!) to create a new point release of AMARC that became part of the BWM package.

The creation of a BWM patch alone was a very error prone and tedious process. Depending on how complex the accompanying BWM installation instructions were, AMARC management would actually put several Bell Labs engineers on a flight, along with their precious 9-track BWM cargo which they would hand carry to the RBOC central offices and assist the RBOC in installing.

So it wouldn’t surprise me in the least if those worst case scenarios wouldn’t add another cost factor of 2 or 3 at a minimum. There was a lot of labor costs involved with those BWMs as well as travel-related expenses. It was a different day than it is today where software is delivered online. Most people forget how much of a pain it was to install software from DVDs and some may even remember floppy disks, but 9-track takes were worse. There were cases reported where they would get to a central office and couldn’t install a tape because it turned out the read/write heads on the TU10 tape drive were out of alignment so they hard to send a new tape. (They did try reading the tape on a different one than the TU10 that had recorded it to ensure that those heads were within spec.)

Now if there were a major design error that would have caused a problem like the one I described, instead of some coding error, it would have been even more costly, especially if you would account for all the lost revenue because AMARC crashed and lost about 2 hours of long distance billing data.

I can’t speak for the cited IBM figures that Roger Pressman cited because I’m not aware of their development processes or how their revenue stream worked. But if you consider something like all the additional expenses and all the lost revenue in these worst case scenarios, it’s certainly plausible that a bug found early in the design process could cost 100x or so less to fix than one discovered post-release. It’s a very different world that we live in.

So, in my mind, if Pressman is to be blamed for anything, it’s that he’s never bothered to update those figures for common day development processes. Unfortunately, once outdated hyperbole like this gets started, they take on a life of their own, so yeah, there’s definitely damage done.

One last thing...I remember reading and discussing a proprietary Bell Labs Technical Memorandum with a different project team somewhere around 1985-1987 or so. The supervisor of the test team for the Network Control Point (NCP) project, Dennis L. McKiernan [yeah, the same one who has authored many fantasy series novels] brought it to our attention. The details are fuzzy, but I think the author of that Technical Memorandum studied the #5ESS project (which was Bell Labs biggest project at the time in terms of lines of code) and reported a more modest factor of 20 in bugs found and fixed early in the software development projects vs those fixed post-release. I don’t recall the author of that TM’s name, but if you know any old hats from Bell Labs (besides me), they might remember. (Some of the surviving dinosaurs from that era at Bell Labs had much larger brains than I. :) I don’t think the report made it into the BSTJ, but it might have; if we could find the author’s name, we may be able to dig it up especially if it was ever published in the BSTJ.


-kevin “Mr. TL;DR Dinosaur” wall

P.S.- I wish I could say that the situation of intellectual laziness has improved for those citing statistics in IT / computer science projects, but even papers written by CS professors in academia frequently omit so many details to make any experimental results hard to verify because the experiments are seldom repeatable. (There are some, but it is far from the norm, especially when it comes to software engineering practice and metrics. Maybe that would make a nice topic for a future Crash Override blog post. I would attempt it, but this one completely wore me out and it takes a lot longer for dinosaur batteries to recharge. ;-)

No comments:

Post a Comment