Thoughts on CI/CD

01 Mar 2022, 00:00

technical / SDLC / software development / CI/CD

Yet again I am thinking about the overall software development/deployment life cycle, and am back to thinking about CI/CD again.

I should be building out a second MQTT server and setting it up as a failover. I’ve written before about MQTT being a SPOF. But I am afflicted with the curse of the software developer: laziness. What I really should do is solve it correctly, forever, with a means to automatically deploy it. I need it in TWO locations, and if it’s useful to me it probably is useful to others, so I should automate it. But then I get busy in my day job and then I write blog posts and I still have not done it. Not to mention re-writing some IPTABLES rules so I can more easily see my remote cameras. But I digress.

Speaking of that, I have similar problems there. I’m writing a ton of sample code for some cool new things (no leaks here!). I prefer Terraform over CloudFormation. But both are what I would call “first generation” IaaS tools. All they really do is deployment of infrastructure. A “second generation” tool would be the Amazon Cloud Development Kit (CDK). It’s new v2 with constructs is quite nice. It even has some minimal CI capability, especially in the Amazon Lambda Nodejs Library where you can bundle your typescript into the lambda. That’s been very helpful, especially as an easier way to use typescript for lambdas. Don’t worry, there’s a whole pile of example code and blogs coming around that. Patience, and what a few weeks for a certain trade show… I’m especially interested in the new CDK Pipelines that let you build CD pipelines for CDK scripts. Those move what is a developer tool (‘cdk init’) to something you can really use in production.

But there’s a lot to be desired. All that works for certain kinds of cloud workloads. But not all.

On-Premise Environments

CDK works for an AWS cloud enviroment, but can’t work for my on-prem enviroment. That’s not totally true either: there is the AWS CodeDeploy Construct Library that I really should try. But that won’t work easily for my environments since the whole VPN thing is complicated. Both my ISPs operate on private address space, so doing a VPN to a cloud is complicated. I’ll try to blog about that another time. Plus, I don’t really want to do a cloud-based deployment for stuff that runs totally on-premise. It’s just introducing another dependency. So CDK isn’t especially a good solution for my home problems.

Cloud Native

If I was going to do things “right” for the MQTT problem, I would set up a three node kubernetes cluster and let it do the work for me. I have not tried it, but there’s examples of config files and even a helm chart to get started.

But hoo boy, now we dive down the rabbit hole. How to manage that? Go all the way to gitops? Not familiar? GitOps is the operational practice that NOTHING in your deployment is valid if it did not get there through a git checkin, usually of a helm chart. For the newbies, Linux Foundation has a course that is probably generic enough. I spoke this morning to a great Engineer who is now leading a DevOps team at a startup and they are all in on GitOps. They use ArgoCD and their developers love it. Great GUI. We went that route at Cisco when I was there too, at least in the group I was last in. Flux would probably be more my style, since I am a command line person more than a gui person. Both have come far in the last few years.

But What About Cloud-Cloud?

Back to my day job and CDK. It’s not kubernetes (k8s for short). In fact, in all of 2020 as a leader of Solution Architects helping literally thousands of customers, I only saw a few customers using k8s. My advice for anyone writing new software would definitely be to consider k8s, especially if you have the developers who can embrace it. But most customers were early in their cloud journey and they were buying at least some of their software. Or, migrating to cloud and at least some of the software was “as a service.”

In those scenarios, the “Cloud Native” solutions can’t help a bit. And they don’t deal with IaaS at all. That would be a problem for someone wanting to build major communications software using my day job product (the Amazon Chime SDK), especially the telephony parts. The alure of our offering is that you can run a script and create real phone numbers in the cloud that you can then programatically control. But those are cloud resources, and “cloud native” tools like ArgoCD/Flux/Helm/whatever can’t help at all. Same thing if you need to create S3 buckets, for example. The CD tooling can deploy the SOFTWARE, but not the infra. CDK can do both, but cannot do k8s. Both have problems if you also want to deploy on-premise.

In short, there isn’t a good tool that works for both.

But What About Cloud Services?

Again, a day job problem. Under the hood cloud services are web-services (usually but not always REST services). Software that consumes those services are configured with the “production” endpoint (URL) and stuff just works. Some new services are in beta (or gamma) and the way those are exposed is through a different “endpoint” which often is “allow-listed” somehow to limit who can use it. This is extremely common in any non-trivial system composed of micro-services. The difference between “environments” (alpha, beta, gamma, stage, production, whatever you call it) is often the endpoint you use.

So, a proper CI/CD system would also keep track of the “environment” and have a way to store the endpoints for different services, so that it could be discovered by software. There’s software that enables this. Consul and Zookeeper come to mind. K8s uses etcd for this. So what, now I have to roll that out too, in order to do a proper staged pipeline of CD? Hmmm. Lots of extra work there, and for darn sure not something I would do in a small on-premise network.

In short, the tools don’t support this key need.

Counter-Arguments, and Answers (which may be just more questions)

“Traditionalist Cloud Practitioners” (ha, have we done cloud long enough for those folks to really exist yet?) would argue that except for small companies you need “separation of concerns” and that the Infra team needs to be different than the software team, blah blah blah. Sure. The skill sets are different. But if the software developers are not intimately familiar with the infra they need, and more importantly the security limitations (IAM rules, in AWS speak) there’s a natural gap that leads to friction and mis-understandings. If not during development, definitely during an incident. The infra folks need to understand what the software is doing too, for the same reasons. Bosses should care deeply, since the ability to increase your feature velocity is directly connected to the most important (for speed) “ility” there is: disposability. I’ve written about that before. So you see, the whole idea that we can use CI for the developers and CD for the Infra/Ops team is nonsense, if you want to go fast, test better, and increase your confidence in reliability. They need to work together. In fact, they need three things to be “DevOps” in my opinion. Read about that here.

But what about Jenkins? Well, it’s getting old in the tooth. How are you going to deploy jenkins itself? And it’s just CI. It does not touch CD, let alone IaaS deployment.

But Terraform! Yes. Aside from the fact that it’s it’s own language, it is amazing. It does IaaS deployment very well. But it can’t touch the other parts.

Practical Implications on Agile Software Development

Building software is hard. Operating software is harder. Doing it well is harder still. Doing it at scale is herculean. Adopting agile methods has proven to help though. But that’s a process on how humans work. I especially admire the ebook GitOps 2.0 by CodeFresh. I’ve not yet looked at their product but in the book they eloquently talk about the need for “feature observability.” In a practical software shop management will ask “how long will it take to add feature X?” Agile can answer that, usually. But there’s a lack of tooling to answer “which code release has feature X in it?” and “at which stage in our pipeline is feature X?” This is because features tracked in the agile tools are not tagged to actual CI builds, and CI build tags are not visible in tools that show what’s in various pipeline/environment stages. I see this in my day job all the time - and at every shop I’ve been a part of. Remember, “done done” means “in production, doing real things for real customers.” Anything short of that the feature is not done. But the developers THINK it’s done, since it passed their tests and the first stage of your pipeline. Tracking that last bit is a gap most shops don’t have a good solution for.

So What Is the Answer?

I’m not sure yet. And it’s stalling me, since I really should get my home MQTTs to be redundant, and I can’t stomach doing it with IP-address takeover and configuring my Ubuntu boxes manually. I’ve been spending so much of my time on CDK, as well as learning how to develop services in typescript. My instinct is that ultimate answer will be writtin in go, but that might just be wishful thinking. Here are some things that I wish I had in such a tool:

works on premise and in cloud (best: able to support multi-cloud)
supports containers, because that’s how software should be packaged these days
supports k8s as an option, not as a requirement - should support running with systemd for isolated services
has a secrets and configuration store that can be auto-discovered (suitable for on-prem or cloud)
optionally can plug into monitoring sytems, or at least alerting systems, for deployment failures

Conclusion

Wishful thinking still, I guess. There is no CI/CD ring to rule them all. But if you know of tools I should be aware of, drop me a note. Twitter or LinkedIn, or old fashioned email at gherlein herlein.com. No spam please.

And since the perfect is the enemy of the good enough to ship, I suspect I just need to do a 3 node k8s cluster and a helm chart and be done with the MQTT problem. We’ll see how long that takes me to get to!