First off, everyone should read the Jim Gray report he wrote in 1985 about Tandem NonStop.
http://www.hpl.hp.com/techreports/tandem/TR-85.7.pdf
There’s a sort-of-Cliff-Notes summary of the talking points here:
http://mononcqc.tumblr.com/post/35165909365/why-do-computers-stop
There’s a separate but related rant by Richard Cook
http://www.ctlab.org/documents/How%20Complex%20Systems%20Fail.pdf