What happens when IO fails?

A couple weeks ago, when my ingress controller traefik didn’t restart correctly, my blog & all other services went down. In a quick knee jerk reaction to ensure my services stayed up, I did a quick down && up on docker-compose, but it took a while. Specifically, it timed out when trying to bring up the container! How could docker timeout? What was happening? What is this random error traefik is throwing when it attempts to start? So many questions, and all the time in the world to investigate. So I buckled my 5 point harness and got ready. Checking the server stats we were running close to the edge of our 4GB of memory and running out of swap, there was legitimately no “room” for traefik to start up. 4GB of completely active memory. I know what I’m doing, I’ll tear down the whole stack and start it up again. Within 7 minutes we are back up and running. Mail is flowing in, my blog is serving traffic and all my build lights are green. But all was not right…

This server you are reading this on, (unless you are reading the blog post on my feed), hosts a few apps and has been very stable since my initial purchase. I’ve talked in previous blog posts about moving to a local (in country) VPS host, rather than hosting in the cheapest country possible. They have been very accommodating with my requests, and I could easily recommend them. As a fellow Systems Admin, the last thing you want to be dealing with when all hell breaks lose, is every client ringing you/flooding your support inbox, asking a heap of questions. I know oh-too-well. I’ve even purchased more servers at the behest of my bank account. /endplug

7 days later

Unfortunately, I had my first real issue. The node I had my server on was attacked by Thor’s hammers itself. Beaten to a pulp until it chugged along like a coal burning steam engine: slowly. I experienced a RAID, SAN or some sort of IO failure. This was not originally apparent as I had been wrangling with OOM-killer on a regular basis over the last week. I even went down the rabbit hole investigating why mailcow started using a new cron container called ofelia. Why was it using 250MB of precious memory. That is my memory, give it back insert picture of Gollum here! So unbeknown to me to the lacking of IO, I started to provision an extra swap file, while I investigated this new cron container. dd, mkswap, swapon and I realise I have a problem that is not something that I can fix. Business savvy people would refer to this as “degraded performance” of my node. Translated: everything is on fire!

When I was creating the swap file, I used dd if=/dev/zero of=/var/swap.file bs=1k count=1024000 to provision a 1GB swap file. This was taking an arduously long time, when in reality should have taken at maximum 5 seconds. As I had another server with the same host I did a little testing:

Server 1:

dd if=/dev/zero of=/var/swap.file.1 bs=1024 count=1024000
111746+0 records in
111746+0 records out
114427904 bytes (114 MB, 109 MiB) copied, 31.4687 s, 3.6 MB/s
(Cancelled after 30 seconds...)

Server 2:

dd if=/dev/zero of=/var/swap.file.1 bs=1024 count=1024000
1024000+0 records in
1024000+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 4.83522 s, 222 MB/s

Yep, that is an IO issue, an abysmal maximum of 3.6MB/s throughput. Resulting in docker containers timing out on boot, mail delivery being delayed 5+ hours. But I was still getting mail, and my blog was intermittently reachable. After lodging a quick support ticket I continued to investigate. Degraded IO, appears to seriously effect docker like no tomorrow. Adding/setting the environment variable COMPOSE_HTTP_TIMEOUT=360 allows docker-compose to take forever to start containers. So instead of containers failing to start, they had at least started. On top of that my static blog was intermittently was going down and up. Something that I could not fathom, a static site was going down? I’m sorry, this isn’t some Wordpress site, or SPA in javascript relying on multiple micro services. This is a highly optimised and lean blog.

My blog is served by a nginx container, behind traefik doing the SSL termination. But the html code is just sitting on the root volume. It seemed unlikely that just the traffic to this blog is maxing out my IO. Definitely not 3-4MB/s worth of traffic. After removing my build-server woodpecker to another host and migrating everything except my mail-server and blog it stilled showed signs of inconsistent behaviour. Further investigation into actually what our mail-server is doing would have been needed if I was going to be stuck with a maximum of 4MB/s throughput. With anti-spam, anti-virus, caldav, carddav, and all the incoming mails it seems mailcow can use close to or at least 4MB/s of throughput. Especially with in-memory logs and those associated crons… (yes I’m still looking at you cron…). Finally I had a solid theory. But my host was very helpful and migrated the VPS to an unaffected node. Resulting in the smooth buttery experience you are getting now. Unfortunately thus cutting my investigations short, but 250MBps of throughput results in a whole lot of not caring anymore.

Other cloud(s)

I’ve had a few days to think about this issue, and how if I had another provider could it have happened? Big cloud providers automagically notify and give you the ability to stop-start to migrate nodes that have failed. I believe this use-case could also happen in AWS or any other cloud where your underlying volume (EBS) runs out of IO “credits”. Before GP3, IO credits were proportional to the size of your drive. So if you had a 300GB drive, you would have 600 IOPS. (Note: this is an example, and is not guaranteed to be correct, I’ve already moved to GP3 and dismissed GP2 into the ether). If you scaled up to 1TB you would get around 1000 IOPS, but if you had 30GB you would only get a measly 300 IOPS. If you needed a huge amount of IOPS you would need to extend your drives to absurd amounts. Having a 1TB OS drive that has 970GB free just so you can have super high IOPS, would cost you $100/month ($0.10/GB). With enough logging, you could exhaust your 8GB worth of IOPS credits. Finally AWS heard our cries and made GP3 available. This allows you to have 3000 IOPS as a baseline for no extra cost. I can guarantee that you probably have not even thought about your EBS credit IOPS, go to cloudwatch now, and setup an alert! Because if you are hitting your IO limit, you are going to have to create an incident report on your “degraded service”.

The Conclusioning

If I had my time again, I think I would investigate more into the inner workings of why docker was acting the way it was. How I would monitor the slow decline and failure could be an interesting followup blog post, but without access to the hardware I could only investigate from “inside” the failing box. I think adding a sidecar docker container for monitoring would be a start but when you are already in a failing system, adding extra complexity never sounds like a great idea. Looking back now, I believe I actually misdiagnosed ofelia as the cause of contention. Most likely it was the last straw that broke the camel’s back as it was the newest container that mailcow was using. The huge IO issues inadvertently made all container operations take forever. Simple things such as delivering mail, updating clamav signature database or serving html files.