When we buy a house, we also buy insurance on the house. Catastrophic events can not only leave the homeowner without shelter, they can be financially ruinous. In the event of a catastrophe, the homeowner has to be able to respond quickly to contain any problems, and be able to either enact repairs herself, or mobilize people and materials to fix the damage. Repairs require building materials and a wide variety of skills (electrician, mason, carpenter, decorator). Most homeowners don't have the construction skills necessary to perform their own repairs, so they would have to set aside large capital reserves as a "rainy day fund" to pay for such repairs. But homeowners don't have to do this, because we have homeowner's insurance. In exchange for a little bit of cash flow, insurance companies make policyholders whole in the event of a catastrophe by providing temporary shelter, and providing the capital and even the expertise to make repairs in a reasonable period of time.
When we build software, we don't always buy insurance on the software. For software we build, we underwrite the responsibility for the fact that it will be technically sound and functionally fit, secure and reliable, and will work in the client and server environments where it needs to operate. As builder-owners, we have responsibility for these things during the entire useful life of the software. This obligation extends to all usage scenarios we will encounter, and all environmental changes that could impair the asset. If regulation changes and we need to capture additional data, if a nightly data import process chokes on a value we hadn't anticipated, if our stored procedures mysteriously fail after a database upgrade, if the latest release of FireFox doesn't like the Javascript generated by one of our helper classes, it's our problem to sort out. There is nobody else.
In effect, we self-insure software assets that we create. When we build software, we underwrite the responsibility for all eventualities that may befall it. Self-insuring requires us to retain people who have the knowledge of the technology, configuration and code; of the integration points and functionality; of the data and its structures; and of the business and rules. It also requires us to keep sufficient numbers of people so that we are resilient to staff turnover and loss, and also so that we can be responsive during periods of peak need (the technology equivalent of a bad weather outbreak). Things may be benign for most of the time, but in the event of multiple problems, we must have a sufficient number of knowledgeable people to provide timely responses so that the business continues to operate.
The degree of coverage that we take out is a function of our willingness to invest in the asset to make it less susceptible to risk (preventative measures), and our willingness to spend on retaining people who know the code and the business to perpetuate the asset and to do nothing else (responsiveness measures). This determines the premium that we are willing to pay to self-insure.
In practice, this premium is a function of our willingness to pay, not of the degree of risk exposure that we are explicitly willing to accept. This is an important distinction because this is often an economic decision made in ignorance of actual risk. Tech organizations are not particularly good at assessing risks, and usually take an optimistic line: software works until it doesn't. If we're thorough, previously unforeseen circumstances are codified as automated tests to protect against a repeat occurrence. If we're not, we fix the problem and assume we'll never have to deal with it again. Even when we are good at categorizing our risks, we don't have much in the way of data to shed light on our actual exposure since most firms don't formally catalogue system failures. We also have spurious reference data: just as a driver's accident history excludes near-miss accidents, our assessments will also tend to be highly selective. Similarly, just as an expert can miss conditions that will result in water in the basement, our experts will misjudge a probable combination of events that will lead to software impairment (who in 2006 predicted the rise in popularity of the Safari browser on small screens?) And on top of it all, we can live in a high risk world but lead highly fortunate lives where risks never materialize. Good fortune dulls our risk sensitivity.
The result is that the insurance premium we choose to pay in the end is based largely on conjecture and feeling rather than being derived from any objective assessment of our vulnerability. Most people in tech (and outside of tech) are not really cognizant of the fact that we're self-insuring, what we're self-insuring against, the responsibility that entails, and the potential catastrophic risks that it poses. Any success at self-insuring software assets has little to do with thoughtful decision making, and more to do with luck. If operating conditions are benign and risks never manifest themselves, our premium looks appropriate, and even like a luxury. On the other hand, if we hit the jackpot and dozens of impairments affect the asset and we haven't paid a premium for protection, our self-insurance decision looks reckless.
Insuring against operating failures is difficult to conceptualize, more difficult to quantify and an even more difficult to pay for. We struggle to define future operating conditions, and the most sophisticated spreadsheet modeling in the world won't shed useful light on our real risk exposure. Willingness to pay a premium typically comes down to a narrative-based decision: how few people are we willing to keep to fix things? This minimal cost approach is risk ignorant. A better first step in self-insuring is to change to an outcome-based narrative: what are the catastrophes we must insure against and what is the income-statement impact should those happen? This measures our degree of self-insurance against outcomes, not on costs.