As a “Friday afternoon fun project” I decided to tinker around a little bit with Tableau on Amazon Web Services EC2. Then bad Russell arrived in my head and kept on asking questions.
A weekend later, I find that I’ve stood up and banged on about 20 distinct EC2 instance configurations. I installed Tableau Desktop & Server to each one and compared what I saw. I’m tired but vaguely satisfied. God knows what my bill from Amazon will be. Oh well.
In this two-parter, I’ll let you know what I found. We’ll look at Tableau Desktop Performance first, then deal with Tableau Server.
The Goal
Understand HOW Tableau behaves on EC2.
Not the Goal
Create prescriptive guidance on the instance type and storage subsystem you should stand up for various sizes of Tableau Server. As always this “depends”, and you need to do your own due diligence.
What I did
I decided that my “base” dashboard should be an interesting one. I used a ~220M row extract (about 5 GB on the file system). The dashboard itself contains two sheets, each of which need to touch all rows in the extract:
- Viz One plots the trade price of a couple hundred equities across time (quarter) and trade size
- Viz Two plots the trade price for two securities across time on a daily basis
On my main workstation I see the following results:
- Cold Load: 17 sec
- Cached Load: 15.4 sec
My Mac (jealous much?..a very early beta):
- Cold Load: 24 sec
- Cached Load 11 sec
Windows Running on Mac (Parallels):
- Cold Load: 16.25 sec
- Cached Load: 14.55
Interesting / semi-germane info:
Why do I break down cold vs. cached loads? Well, there is an IO cost to load my extract up into memory. Once it’s there, it stays there unless knocked out by something else.
This means I literally can “touch” an extract so that it is mapped into memory, close Tableau entirely and walk away, then re-launch Desktop and reload my viz a few minutes later….there’s a very good chance I’ll see a significant improvement in load time, even though a completely different process mapped the file into memory at some point in the past. This is pretty important, particularly if your disks are slow and it takes a while to do that initial loading of data into memory.
So let’s look at some data. You see 10 EC2 instances below, ranging from two to eight cores. I configured each machine with anywhere from 0 to 1400 provisioned IOPS for storage (click the previous hyperlink to read about what an IOPS means to AWS, or click here to read about IOPS in a more general sense). All tests were run on Windows 2012 Server Standard.
Click, CTRL-Click, or SHIFT-Click to compare one instance to another in terms of how quickly I was able to render the test dashboard.
The run on the FAR right took a whopping 82 seconds to render (uncached) on a 2-core, low GHz machine with poor IO – 300 IOPS:
I used the free ATTO Disk Benchmark tool to test throughput on this drive, and look at how pathetic it is. We can be pretty sure that the big extract is taking a long, long time to load into RAM based on these read numbers.
But what about once it’s cached? The pain just keeps coming. Since we only have 2 cores and they are relatively low frequency, things still stink:
As you can see above, our friend is grouped in a cluster of readings taken from 1.8 and 2.0 GHz CPUs. This group collectively has the longest cached render time.
Let’s shift gears. Where did we do well?
You can see that in general, instances running storage with the most IOPS win:
#3 is a machine running 4000 IOPS – It was the champ in terms of rendering speed. Here’s what disk throughput looks like:
We’re getting about 3x the throughput vs. 300 IOPS.
Group #1 contains machines which have either 4 or 8 cores running at 2.8 GHz and between 1200-1500 IOPS.
Throughput, 1500 IOPS:
…and 1200 IOPS:
Group #2 is where things really start getting interesting. Note that this cluster of readings is taken from machines where we have 0 provisioned IOPS! How can 0 be better than 300, 1200, or 1500?! Well, this disk is not an EBS volume. It is an AWS standard volume and it behaves quite differently.
Let’s let the test tell the story:
While the write performance is just about as poor as the 300 IOPS machines, look at that read! It is actually superior to the transfer rate from our 4000 IOP rig. Since we’re only reading the extract into RAM as part of our rendering process, this suits us quite nicely.
I wouldn’t want to use this standard volume to act as a temp drive or to write extracts on, but it’s just dandy for reading.
What have we learned?
Disk matters. Having poor IO is just like skipping your legs at the gym. Only bad things can happen.
Until you have sufficient IO to read extracts into RAM, your CPU will literally stall. Here’s what I mean:
Note that when this “un-cached” dashboard was rendered, the CPU wasn’t working all that hard. Our disk queue was at about .361, indicating that the disk was working.
Now, look at the same machine rendering a “cached” report:
The CPU is working ~3x as hard on this puppy – and look at the disk: not a whole lot going on there right now.
If your disk is fast enough, your CPU works as hard as it can for you – but if you can’t keep it “fed” with data, than it just sits there.
This next part shouldn’t be hard to understand – once we have the data we need loaded into memory, he who has the fastest CPU wins:
Group 1 represents our 2.8 GHz CPUs. Note how they all cluster together right around 15-17 seconds.
Group 2 is a single box with a 2.0 GHz CPU. Render time = ~30 sec
Group 3 is our stinky 1.8 GHz cores. Render time = ~30 sec
Let’s summarize what we learned and how to apply it:
- If you’re using extracts, you must have decent read performance from your disk
- Having good write performance is important when creating extracts
- A relatively low number of EBS IOPS won’t buy you much performance
- To get a rough estimate of throughput, use ATTO Disk Benchmark
- An EBS volume with provisioned IOPS can guarantee a consistent-ish number of reads and writes. However, you may just do better to use a standard volume which delivers pretty good read performance at no extra cost
Now, it’s the bonus round. When you provision an instance, it comes with some root storage and “Instance Storage”. Here’s the c3.xlarge machine I’ve been using a lot in my testing:
Instance Storage is ephemeral storage that only stays around as long as the image is running. If you stop the image, anything you saved on those two disks (in this case, 2 40 GB SSD drives) go away. A quick reboot of the OS is fine – but turning the machine OFF means that storage is wiped.
What if I kept my big 5 GB extract on the C: Drive, and copied it to the Y: or Z: drive on the machine above when I wanted to do work..would that be fast? It damn well would be:
Ready? Boom!
The highlighted run above was sourced from the Z: drive. I arbitrarily “tagged” it as 4000 IOPS, but it obviously is going faster as it completed in 24 seconds vs. the 32.30 second run directly to it’s right (the prior “champ”).
The 24 second run used 27% of CPU vs. only 18% of CPU on the 32.3 second run – an example of good disk allowing the CPU to run the way it wants to. (Keep in mind we’re using the same CPU here)
We can see it came in FIRST across all 2.8 GHz CPUs, actually:
#1 – The run backed by SSD
#2 – The run backed by a 4000 IOP volume
#3 – The run backed by a standard volume
When all is said and done, #1 is actually closer to “cached” execution time than to it’s “un-cached” brethren. Pretty nice!
To reiterate: since the storage I used for this final test is ephemeral, it probably wouldn’t be a good permanent solution. If I wanted to get SSD-like performance, I’d need to do something like the following in AWS:
- Create 2+ 4000 IOP volumes for my instance
- Use Windows Server Storage Pools to create a RAID 0 – like stripe across the 2+ volumes
That would be uber-speedy, but I’m getting tired and have decided to stop.
Next article: “So what?! How fast can I make Server go?”