The Intro
The AWS EC2 c4 instance type has been around for a while now, and I finally got some time to compare the performance of Tableau Server running on a c3 vs. c4.
Of course, I used this as an excuse to leverage TabJolt – I have a huge TabJolt crush right now.
I’m not going to write tons about high-level differences between the c3 and c4 because you can do this sort of due diligence stuff yourself.
That said, if you want to “go deep”, here are two great reads:
- http://research.gigaom.com/report/generational-performance-comparison-amazon-ec2s-c3-and-c4-families/
- http://blog.cloudability.com/aws-c4-power-worth-price/
From a Tableau Server point of view, here are the big benefits of the c4 as I see them:
- Better/faster CPU – Tableau Server bottlenecks on CPU, so this is important
- EBS Optimized by default
EBS allows you to give your machine provisioned IOPS, and guaranteed disk throughput is a good thing when it comes to Tableau Server. For the c3 you pay a bit extra per hour to add this capability:
By the time you pay extra for EBS on a c3, you’re already paying more per hour than you would on a c4…so this is a no brainer as far as I’m concerned.
The test I couldn’t do
The very first thing I wanted to do was test each of the vizzes in my workload (defined here) by running them one at a time on a “16 core” c3 or c4 machine. I wanted to see what came back faster.
The tests I’ve done in the past point to the fact that while dealing with a relatively low number of concurrent users, “the bigger the box, the faster the viz returns”. After about 30-40+ concurrent users this is no longer the case.
Running this test proved to be impossible. The c3.8xlarge gives me the equivalent of 16 physical cores, but the c4.8xlarge delivers 36 vCPUs…or about 18 cores of goodness. There was no way I could do an apples-and-apples comparison on a “big box”.
So, I scrapped the “How fast can this viz run in a vacuum” angle and instead just started doing load testing with two configurations:
- 2 Instances with 8 cores each = 16 cores
- 4 Instances with 4 cores each = 16 cores
These two configs match the 2 x (8 Cores) v2 and 4 x (4 Cores) v1 implementations I detailed at the end of this blog entry.
For each of the two setups above I executed a light and heavy load test. I did so ramping from 10 to 260 concurrent users @10 users at a time every 10 minutes….just like I’ve been doing with TabJolt over the past month or so.
All I did during my tests was switch the instance type from c3 to c4 and back again. I used the exact same disk volumes with the same EBS provisioned IOPS (1500).
Top Line Results
(Standard disclaimer: These are my results with my specific workload. Yours will be different. Use this information to begin informing your decisions…but ultimately you need to test this stuff yourself.)
No surprises: c4 better. c3 worser.
Based on the type of workload and the configuration I was running, I saw a minimum 10% increase in tests per second (TPS) versus the c3. At the high end, my c4s delivered 22%+ more TPS. I observed a lower overall error rate in my tests on the c4 configurations as well.
Lets’ go to the tape and start with the light load tests.
In the images below, I refer to the c4 machines by their processor family – the Haswell V3. So, V3 = a c4 instance type and V2 = the older Ivy Bridge V2 processors used by a c3 instance. Got it?
As you can see, the V3 (c4) configurations delivered more successful TPS at scale regardless of how I split my 16 cores up. Both the 4 x (4 Core) and 2 x (8 Core) configurations beat their older cousins.
Next, let’s compare the 2 x (8 Core) configurations head-to-head:
That big spike in TPS at the end of the V3 test looks suspicious to me – especially taking into the account the error rate spiking to about ½ %.
Note the big difference in error rate between these two.
How about the 4 x (4 Core) light configurations?
Again, V3 / c4 wins – this time with an even lower error rate than we saw on the 2 x (8 Core) configuration. No ugly spikes, just a nice clean even line.
Moving along, let’s focus on the “Interact” workload that TabJolt executes. Each time a test is fired, there are multiple steps – like rendering the viz, then applying several filters…or selecting some marks.
There’s a distinct “shift up” on the V3 (c4) tests in terms of TPS. That’s more tests being run successfully. Note that all tests are completing “to the left” of the 15 second line. Not so for the V2 / c3 tests.
Confusion!
This next bit is interesting. As usual, I also ran a limited number of “read-only” tests – TabJolt simply executes a report and goes away – no follow up transactions are fired. We’ll therefore leverage cache a lot more and get higher throughput.
For reasons completely unknown to me, the V2 / c3 machines actually out-performed the V3 / c4 configurations in terms of TPS. These tests also completed faster.
No clue why this is occurring. Since this is a head-scratcher, I’d normally re-test to confirm. But I didn’t. Sorry.
When directly comparing the 2 x (8 Core) configurations you see a higher number of samples being executed at a lower response time and error rate pretty much across the board.
Same thing holds true for 4 x (4 Core) machines, except the spread looks slightly wider.
This next bunch is the good stuff. We can easily see the (increased) TPS the V3 / c4 delivers versus the V2 / c3 on a per-concurrent user basis. About 10.5% of awesome for the 2 x (8 Core):
…and 11.27% for the 4 x (4 Core) rigs:
Let’s get Heavy
Let’s repeat, but with our heavy load. We see some differences in behavior here.
For example, V3 / c4s don’t hold the “top two positions” anymore in terms of TPS and low response time:
The V2 / c3 – 2 x (8 Core) isn’t hot compared to everything else:
…but it’s pretty clear that at least one of the V2 / c3 configurations is beating the V3 / c4 – 2 x (8 Core) rig.
Totally full size, as always. I’m getting bored now.
The c4 – 4 x (4 Core) is our champ, but the c3 – 4 x (4 Core) comes in second.
Our V3 / c4 boxes both outperform overall – note the upshift on “Interact” tests vs. the c3:
It appears that because of the lower error rate on the V3 / c4, we’re able to do better on the “read only” test than we did on the light load….but the V3 isn’t significantly out-performing the older V2 / c3 here, either.
Not the small pic. The full sized one.
No surprises when we directly compare configurations, either. 2 x (8 Core):
However, the v3 – 4 x (4 Core) error rate looks marginally higher.
I bet that if I re-ran this I might see a cleaner result. But…well, you know. I didn’t. This is quick and dirty testing, after all.
And finally, “return of the good stuff”. Here we have the c4 2 x (8 Core) out-performing the c3 by 16%:
The 4 x (4 Core) does even better – nearly 23% better:
That’s it, race fans. Buy those c4 instances!