Admiral Hyman Rickover (1900-1986), the “Father of the Nuclear Navy,” was controversial. He taught a whole generation of the best Engineers. Some of my thoughts on this.
Admiral Hyman Rickover (1900-1986), the “Father of the Nuclear Navy,” was one of the most successful—and controversial- public managers of the 20th Century. His accomplishments are the stuff of legend. For example, in three short years, Rickover’s team designed and built the first nuclear submarine—the Nautilus—an amazing feat of engineering given that it involved the development of the first use of a controlled nuclear reactor. The Nautilus not only transformed submarine warfare, but also laid the groundwork for a whole fleet of nuclear aircraft carriers and cruisers (which was also built by Rickover and his team).
Admiral Rickover passed in 1986, but his impact reached far beyond his grave.
I trained in Rickover’s Nuclear Power Program. I went through Nuclear Power School in Orlando Florida in 1986 and Nuclear Prototype in Idaho Falls, Idaho in early 1987. I served on the USS Richard B. Russell (SSN-687) from 1987-91. I qualified Reactor Operator/Shutdown Reactor Operator (RO/SRO) and Engineering Watch Supervisor (EWS). That experience shaped my whole way of looking at both Engineering and Engineering Management. I am who I am today because of those experiences and training.
I was the on-watch Reactor Operator when this photograph was taken. USS Richard B. Russell (SSN-687) circa 1990
Rickover’s Seven Rules
The Admiral had Seven Rules. Here’s my notes on how these equaly apply to running a software or SaaS business, with some reflections on my current gig.
You must have a rising standard of quality over time, and well beyond what is required by any minimum standard.
“Good enough” is not good enough. NOTE: I did not say be perfect, and I certainly did not say not to strive for “good enough to ship.” Those are truths of software. But you must strive to INCREASE your quality over time. And that requires constant, unwavering effort and leaders who will demand that things get better. And better. And better.
People running complex systems should be highly capable.
Embedded systems and cloud-based distributed systems are complex by their nature. I know, I’ve built them for my whole career and I lead all the software efforts at BrightSign with both embedded linux and a large cloud footprint (and yes, we are hiring). The folks building and running these must be highly capable. That’s worth a whole posting by itself, but I do cover some of my thoughts on this topic in my personal management README.
Supervisors have to face bad news when it comes and take problems to a level high enough to fix those problems.
This is where a LOT of problems happen. Bad news is like cheap wine: it does NOT age well. Leaders have to be courageous in the face of bad news. You must appropriately escalate that bad news - to the right level. This is where JUDGEMENT comes in. Leaders must have the good judgement to know what problems to escalate, and how high. A P3 bug needs one level. Discovering a potential security exploit in your code base is a very different level.
You must have a healthy respect for the dangers and risks of your particular job.
Fire in a submarine is terrifying. It uses all the oxygen and the smoke is toxic. Fire drills are the most common drill submariners do. Every week. Sometimes every day.
Running a SaaS is not physically dangerous. Since my submarine days I have not had to worry about fighting a fire, plugging a flooding pipe, dealing with a steam rupture, or dealing with radioactive contamination. Let alone keeping electicity generation and propulsion in a war-fighting situation. But building software does have risks and dangers. If you don’t catch a critical bug in the embedded firmware, you could lose millions of dollars of revenue or profit when fixing it. If you miss a security vulnerability your company could get hacked and lose everything. And the risks go down from there.
Many times in my career I have seen a lack of appropriate respect for these dangers and risks. Some places have been really lucky. Some have had those risks turn into disaster. The absolute beginning point of dealing with these is to have a healthy respect for them. In submarines it was not just respect, but paranoia and obscession. It was truly a case where what you did not know could kill you. I may not die from software risks, but I don’t want my company to.
Training must be constant and rigorous.
On submarines, we trained constantly. This is a “wet trainer” exercise done in a special training facility. Practice is critical.
This also gets missed. Teams just assume they can troubleshoot. They assume that since they built it, they know the system. You remember about the term “assume” right? This has manifested for me quite recently and brutally reminded me that you only know if you can do something if you have actually done it recently. There is no substitute for practice, and that means actually troubleshooting and repairing problems in your software or SaaS. AWS does “Game Days” and I want to look at those again closely. But we are putting a lot of effort into creating disposable copies of production for testing. I plan to use that same thing to inject faults into the system and have my team troubleshoot that. PRACTICE. There is no substitute.
All the functions of repair, quality control, and technical support must fit together.
I’m thinking about this a lot as I ponder how to best organize our Engineering resources. I’m rebuilding our QA effort to focus heavily on automation - but especially hiring Software Developers in Test (SDET) instead of just classic QA automation. And I’ve invested in building Swagger/OpenAPI specs and trying to use our own libraries to build our test canaries - and have been doing that work from within our Partner Integration Engineering team. The folks who support our technology partners should be writing code to use the same APIs. I find myself wondering where the line is between “Development” and “QA” and “Partner Support.” Rickover taught me that these all must fit together.
The organization and members thereof must have the ability and willingness to learn from mistakes of the past.
This. This. This. We have started to use incident.io to manage our incidents live, and to have the right data available for a post-mortem. But that’s not enough. Recently we had an issue that took too long to troubleshoot and fix. We did a post-mortem. But have we done all the right review and training, all the way up through management, to ensure that we won’t make the same mistakes? Probably not. And that’s my fault. That’s my job. Those kinds if things don’t happen by accident, and it means I have another important thing on my plate this week.
The bottom line is that what you see regularly is just the tip of the iceberg. Your status reports, jira backlog, grafana dashboards, whatever - those are all just the surface. The parts you need to worry about are all the things you cannot see easily. Knowledge levels. Skill levels. Degree your second tier Engineers are able to troubleshoot things (because your top-tier folks may not always be available). Your database backups that can actually be recovered. Regional failover. Key rotations. Those are all hard to inspect unless you ACTUALLY PRACTICE THEM. I find myself thinking about this a lot.
Admiral Rickover had great lessons to teach (regardless of whether you considered him abrasive or abusive). Those lessons resonate with me 25 years after I was first exposed to them. And they apply to software and SaaS just as much as they do to Nuclear Power. I would argue that they apply to your business too, no matter what that might be. Some things are just eternal truth.
Rest in Peace Admiral. And thank you. And thanks to all the fine Submarine Officers and Chiefs that carried those lessons onward to the following generations.