Prometheus Alert Manager is an “AP” system, in “CAP” terms. How we made it more Consistent, including a forked repo of the Alert Manager code.
First of all, this is not new. My good friend and former colleague David Wang spoke at CloudSphere in Krakow Poland in late 2019. The recording is: CloudSphere: Impedance Matching Legacy Apps to Prometheus Monitoring by Greg Herlein & David Wang.
I’m setting up Prometheus again for some home monitoring and I realized I’d never blogged about it, really, beyond a post that merely had a link to the recording. The overall problem was super interesting so it’s worth a few notes.
Christmas 2018 (actually December 26): my phone rings. It’s the head of Cisco Advanced Services in Japan, and he wanted to know if I knew anything about “SRE.” I told him that I managed a team of SREs, trying to bring an SRE mindset to the core software team in Cisco Customer Experinece (CX), so yes, in fact, I think I did. Cisco had a customer in Japan deploying a huge cloud-native, fully virtualized 5G mobile phone network and their CTO was asking if Cisco could share our perspective and help them organize an SRE team. Long story short, I was on a plane to Tokyo the next week and earned the customers trust. Myself and a few of my team were loaned to the effort to help this strategic customer built out an SRE function.
A key part of the problem space was that the monitoring system they used was great for SaaS applications. And the way the network was being deployed was more SaaS-like. The whole goal was to build it all out as “networking as a service.” But after digging deep with the teams and the technology, we found that there was a huge gap between the low-level networking and how the monitoring system expected to hear alerts.
A lot of the system was deployed using Cisco Network Service Orchestrator (NS) and network systems still generate a lot of alerts using SNMP. SNMP terminology for an alert is “trap.” Most traps are really low level, like “signal lost on line card 22” type low level. Generally a modern Network Monitoring System (NMS) collects and aggregates those low level traps into something more actionable. But this new network was being built in a completely new, modern way, and was not using any existing NMS. They were cusom modifying a monitoring system used for SaaS instead. To make things more complicated, there were at least three vendors providing key software for 5G network functionality, all of whom had their own systems that sent varying levels of aggregation.
A key difference too, if you think about it is that Prometheus is a pull model. It “scrapes” data from REST interfaces, and reacts. The whole SNMP trap model is a “push” model: sending UDP messages to a target destination. It’s sad, but a lot of systems that use SNMP cannot even use FQDN for the target. The failure model for networking gear assumes the worst, and does not want to rely on getting a DNS reply.
This really was a case of impedance mismatch between the networking world and the SaaS world.
Summary of the Solution
We designed and wrote a system called the “Observability Framework.” That’s my fault. I’m not that creative with names. Internally we gave it a code name of “Kusanagi”, named after a legendary Japanese sword. The system was based around Prometheus, running in Kubernetes (k8s). It accepted inputs as SNMP traps, logs via remote syslog, and metrics using the standard Prometheus scraping mechanism. Scrapes were easy, and often we’d just deploy node exporter with Ansible onto key machines. Logs went into an Elastic Search cluster, and we ran periodic queries for known issues and then generated alerts.
The novel thing we did was to use Prometheus AlertManager (AM) as the interface between our software and the SaaS monitoring system. This gave us aggregation aggregation and muting with a known, testable, documented interface. Metrics, logs, and SNMP traps would all map to a set of defined “alerts” delivered to the SaaS monitoring system by AM.
There were a few tricky parts. One was getting k8s running in an all IPv6 environment. Another was dealing with the incoming traps. We pumped them through snmptrapd and output text that piped through beats into Elastic. We had to write a little code to do some custom processing of some of the traps because they were weird, but that was a corner case, really.
The big thing to solve was making AlertManager HA. This was especially critical since we used chaoskube to randomly kill pods. This ensured constant testing that our solution was reliable. But it also ensured that periodically we had an AM that was completely blind to all the alerts that recently happened.
Things may have changed in the last few years. That’s something I’ll dig into as I set it all up again. But for our solution, we discovered that the clustering feature for AM uses a gossip protocol to share alerts. Any alert showing up at one AM is duplicated at another. This provides “AP” coverage, but did not cover “C.” If you rebooted an AM, or started a new one, you would not have state. The new AM would be missing some alerts. That created a problem.
Our duplicate AMs would send alerts to the SaaS Monitoring system, and it was configured to clear alerts/tickets if the system recovered. There were many automated responces that would get triggered by the ticket, run a script/tailf/something to cause a recovery, and then wait for the condidtion to clear. In a complex, cascading fault recovery the duplicate AMs and SaaS Monitoring systems might be partitioned. Since some of these conditions were actual network probles (it was a “network as a service” afterall) we needed to have at least one AM detect that the problem had been cleared and “clear” the alert. We actually needed more consistency. But CAP theorem says you cannot have all three.
So we cheated it a bit. We wrote the code to have AM use etcd to store the alert state. Every AM would put it’s state there (if that alert had not already been stored). Each AM would then periodically sync it’s alert state to etcd, and on startup, would pull all alerts into memory. We submitted a PR for this, but the Prometheus team did not see it as a priority. It probably wasn’t: it was just something this particular use case needed. So we had to fork the code, unfortunately.
Reading over the current AM docs, I suspect that the same gossip sync is still in place. I’ll figure that out in the coming weeks as I set it all up again. My requirements are certainly not for HA, but I do want reliability. My monitoring is mostly my local infrastructure at two locations. I’m still on the fence about the pull vs push model for metric collection, as I’ve talked about in my posts on MQTT and control in general. But I am coming around to it. At the end of the day you have to have a proper inventory of “things” on the network anyway. You need a means to do service discovery anyway. So hook into that and poll things. Prometheus Exporters are simple and fairly bulletproof.
If you have read this far, you should have an appreciation for the complexities of building something to “observe” a complex, mult-vender, multi-technology platform. The CAP theory has very, very real implications on how you design and build systems.
Credit Where Due!
The team that did the heavy lifting for all this are among the very top Engineers I have ever known. Credit to David Wang, Josh Dotson and Hang Xi. Not only did they put up with me, they worked a million hours, never complained (much), and delivered. More than delilvered. I’m proud to have worked with them.