Disposability - the Missing ility

24 Oct 2020, 00:00

cloud / devops

You will get the most bang for the buck from cloud by destroying it.

Bold Statement

That’s right: you will get the most bang for the buck from cloud by destroying it. Disposability may be the most important of the “ilities” if you care about your overall development velocity. I talk about this a little bit in my Cloudy DevOps post but I don’t think I hammered the point well enough. Certainly not in my day job, where I found out this week that we are living with pain around that.

What is Disposability?

Simply put: the ability to easily completely destroy a set of cloud resources you have deployed - and then automatically recreate them again. We focus so much on the automation of the CREATION but disposability is all about being able to easily destroy it again. In AWS you can do this with CloudFormation using Stacks and StackSets. Terraform is a similar tool and is my preferred solution. But regardless of the tool you use, the automation of the creation and descruction is a core cloud best practice.

It’s funny, actually, to talk about “serverless” since it’s the epitomy of disposability. Consider AWS Lambda for example. You literally set certain code to execute on a certain event and the compute resources are created, run your code, and disposed of when done. Serverless embodies disposabililty.

Benefit to Velocity - Decouple!

It’s easy to understand that cost control is a huge benefit. Teams can spin up cloud resources and when they are done with them, perhaps for the day, they can dispose of them. You don’t get charged for resources that no longer exist.

But the biggest benefits may come from a net increase in development velocity. If you can deploy and destroy atomic units of your solution automatically then the team responsible for that unit can work independently. They can build and test that unit with no dependency on any other team. More importantly, the other services that they depend on can likewise be deployed BY THAT TEAM in a development environment for testing.

An example, only slightly contrived: you have three teams. One is the front-end browser app, one is the application software team, and one handles the database and the ingesting and indexing of the data. The front-end team are already used to a model where they can get a clean runtime environment merely by closing a tab. The app team, however, can get substantial value from treating the entire data layer as a disposable object.

Assuming you can run a script that instantiates the whole data layer automatically, and by passing in a parameter the script can populate the database to a known state (last production backup, clone of a running system, or load of a specific test dataset, whatever). The App team can spin up a database, run a set of tests against their code, and then just destroy that database. They can also automate this and use a set of tests against a specific set of known database states. That too is a best practice, as part of a CI/CD pipeline. But if you cannot easily dispose of the resources and then re-create them, none of this works.

If your teams can do this, you don’t need to maintain a “staging” environment. You can create one on demand. You don’t need to have teams bottle-necked waiting for a certain database load to be restored by the Ops team for testing, or for a newer version of another micro-service to be deployed so you can test your implementation of their new API version.

This is literally decoupling the development process the way that micro-services decouple system design. Monolithic software systems can bog down from one poorly implemented part. We decouple our code to solve that. We should also decouple our development processes, and Disposability is the key way to enable that. This is a concrete example of “thinking like a Software Engineer” - only applied to your overall workflows not just to your code.

Accidental Gravity Wells

My day job fell into a gravity well around all this, totally accidentally. We are giant believers in the “Infrastructure as Code” approach, and we keep our terraform scripts in github. We also use Terragrunt to help us keep our code DRY and help us modulularize our code. We enforce a code review process so that changes cannot be merged without a successful approval. We isolate the humans from the running of the code using Atlantis. This operational discipline is a good thing.

But. There’s almost always a but. Atlantis is great for production, but for development it adds a wrinkle. Atlantis manages the terraform state, so the “right” way to dispose of what it created (to do a “terraform destroy”) also requires a pull request and a code review. The change to the scripts is only one line, but then the code repo master branch has that one line change. I guess this is not a bad thing since master then reflects the actual state of the infra. But it just feels wrong. It’s not as easy as it seems like it should be. If we were simply using raw terraform in a main.tf it would be a lot easier, but we do a lot of custom modules and wrap it in terragrunt. So it’s less than ideal.

And that matters, since it looks like making it harder to dispose of cloud resources resulted in teams not actually disposing of things - leading to not decoupling the processes. When you are transforming a development organization to agile, cloud processes you can’t make it harder. Then teams just fall into the gravity well of doing things the way they always have. And that’s really hard to spot unless you are in the trenches with the teams. You may not even see it in your cloud costs. It will show up later in slower overall velocity. And that’s bad.

Conclusion

Just moving your infra to cloud is what I call “Cloud 1.0” and if that’s all you do that’s great. But if you really want to get the actual benefits of cloud then embrace Disposability. There’s a learning curve and some up front investment, but it will pay off downstream in vastly improved development velocity as well as product quality. And it then enables you to do work on all the other ‘ilities" too. And in the principle of “thinking like a Software Engineer” none of this is “done done” unless it’s been tested. And that means creating and destroying in at least a few accounts and ensuring you haven’t left any orphaned resources behind.

Of course, if you are jumping past VM and k8s deployments towards serverless you are already doing disposability anyway. That’s a topic for another day.