I’ve become quite a fan of the AWS’s EBS General Purpose (gp2) SSD storage. Gp2 volumes offer great price performance for Tableau Server, and I’ve pretty much gotten to the point where I tell people “Just use gp2 volumes and you can’t go wrong”.
Well, for every rule, there’s an exception..which I ran into last week.
I was executing a long-running load test (about 12 hours per), and I noticed this:
About 8 hours in, the CPU on nearly ALL of the machines I was watching dropped significantly, with an associated drop in network activity.
(Aside: For reasons unknown to me, the generic disk-based metrics presented by CloudWatch never seem to populate for me – other EBS metrics are just fine, though…so pay no attention to Disk Reads / Operations / Writes, etc.)
“That’s odd”, I thought – I chalked this up to a noisy neighbor on the same host, even though EBS is supposed to protect me from that (from an IO point of view). So I re-ran the test immediately…and saw the exact same thing again, ruling out noisy neighbor issues. Wondering if I was dealing with the “burst bucket” associated with gp2, I stopped my testing for about 30 minutes (to let it “re-charge” a tiny bit), then fired things up again:
One of my machines seemed to be happy for a minute, then dropped back into the same pattern. So I tried again, watching for disk queueing across all of my machines. See if you can guess WHICH box was my problem child, then hit the break:
…If you guessed Worker 1, you win the prize! Go celebrate for a moment. Really, do it..you’re awesome and people don’t tell you often enough. Now, continue on.
Lesson learned: poor IO characteristics on a single machine could (but won’t always) strangle the whole deployment if you’re not careful. Why? Because Worker 1 was running the only instances of my Repository, Data Engine and VizPortal for the whole cluster. You already know the first two of these services need (and use) their IO 🙂
When the gp2 SSD volume gave up the ghost, Tableau couldn’t get in and out of our PostgreSQL database, Extract files, or OS temp folders of the file system quickly enough, slowing everything waaaay down.
I replaced the gp2 volume on Worker 1 with a nice EBS Provisioned IOPS (PIOPS) volume and configured it to deliver 1500 IOPS come hell or high water. I left everything else alone on all the other machines…and Tableau and Behold, all was well. No more “mystery throttling”.
Not Tableau’s fault. My fault.
So what happened? I’m going to copy interesting bits of an AWS blog post verbatim…it explains things. Bold is mine. Thanks to Jeff Barr for this write up.
General Purpose (SSD) volumes are designed to provide more than enough performance for a broad set of workloads all at a low cost. They predictably burst up to 3,000 IOPS, and reliably deliver 3 sustained IOPS for every GB of configured storage. In other words, a 10 GB volume will reliably deliver 30 IOPS and a 100 GB volume will reliably deliver 300 IOPS.
Each newly created SSD-backed volume receives an initial burst allocation that provides up to 3,000 IOPS for 30 minutes.
…the IOPS load generated by a typical relational database turns out to be very spiky. Database load and table scan operations require a burst of throughput; other operations are best served by a consistent expectation of low latency. The General Purpose (SSD) volumes are able to satisfy all of the requirements in a cost-effective manner.
- Each token represents an “I/O credit” that pays for one read or one write.
- A bucket is associated with each General Purpose (SSD) volume, and can hold up to 5.4 million tokens.
- Tokens accumulate at a rate of 3 per configured GB per second, up to the capacity of the bucket.
- Tokens can be spent at up to 3000 per second per volume.
- The baseline performance of the volume is equal to the rate at which tokens are accumulated — 3 IOPS per GB per second.
Essentially, the D: drive on Worker 1 used up the gp2 “boost bucket” of 5.4m tokens after about 8 hours. When it did, it could no longer burst to 3000 IOPS. Instead it could only deliver (3 * 60 GB Drive Size) = 180 IOPS – NOT enough to do what the OS needed it to.
Once I “ran out of fast” the 60 GB volume also couldn’t re-generate tokens quickly enough to “get fast again” because it was so small. If I had used a bigger drive, I might have been OK. For example, if you do the math, a 1 TB drive will re-generate 3000 IOPS every second…meaning I would never run out of tokens!
Summary: gp2 SSD “boost” is a limited resource. If you think you’ll run out of it, consider using a PIOPS volume instead.