Introduction to the Introduction
This is the “setup” post. It is uninteresting, as setup posts can sometimes be. Suffer through it anyway.
Introduction
Over the last few weeks, I’ve been spending a bunch of time with our friend the Backgrounder. Thus far I’ve invested ~150 hours running tests on EC2 to get a feel for how the backgrounder behaves and (potential) optimal deployment patterns. Even though I focused mostly on Amazon Web Services EC2, much of this stuff applies to on-premises installs of Tableau Server, too.
I’m interested in understanding things like:
- What is the “best” instance type for extract refreshing?
- What is the optimal number of backgrounders to deploy on a machine?
- Should I deploy my backgrounders on a big, monolithic “monster backgrounder” node, or multiple, smaller nodes?
Before you invest a bunch of time reading this post, take a second to think about “where you are” with extract refreshing. If you’re mostly satisfied with what your default configuration is delivering, stop now. Take the blue pill – you’re good. Seriously.
If you run tons of backgrounders, deal with long refresh queues, or are trying to “stretch things out to the max”, then the red pill might just be for you. That said, you’re also going to discover that the world is run by intelligent AIs and you’re actually just a battery. Be warned that you’re not going to get any black-and-white answers here and may actually come away with more questions than you arrived with. But that’s fine since good questions make the world go around.
Other disclaimers:
- I don’t speak to Subscriptions here, sorry. It’s not that I don’t care, but they add too many variables into the mix
- I didn’t play with incremental refreshes because creating systems to ADD data to existing data sources would take me time I didn’t want to invest
- Because a lot of this stuff is based off work I did to answer “internal questions”, I sometimes use really weird machine configurations that you probably won’t consider. That’s fine, the goal is to learn
- I assume you can dedicate at least 8 cores to the backgrounder. All of my testing was done with 16 cores or more
First, define as clearly as you can why you’re about to mess with this stuff in the first place. What’s your goal as administrator? Do you want to…
- Drive the highest TPS (transactions per second) regardless of what this does to the runtime of the extracts? Are you focused simply on getting as many extracts “done” as you can?
- Try to make refreshed extracts available as quickly as possible after their scheduled extract time?
It’s likely you’ll land somewhere between these two extremes and you’ll leverage a different approach based on same.
Russell repeats himself again.
If you’ve ever read this blog before, you’ll already be familiar with the mantra “It’s the workload, stupid.” (No, not you.) Everything you’re going to see here is influenced by the workload being tested. Therefore your exact results will always be different because your workload is different. In posts like this I always say, “Don’t take these results as gospel”, and I’ll say it again. In fact, some of the machine configurations I use are just plan wacky. They’re wacky enough that I hope they’ll force you to do some testing on your own to augment what you read here.
The Workload
Workbooks and Extracts
I do this work by myself, without hundreds of (user-created) workbooks with different data sources I can lean on as “guinea pigs”. So, I created seven prototype “test subject” workbooks. They all connect to live instances of PostgreSQL and use one of three or four different sample databases. Some of these databases are small, some are large. Some have complex schemas, some are simple:
The basic sample:
Type | Approx Records | Approx Columns | Run Time | Notes |
x-small | 300k | 25 | 24-30s | |
small | 500K | 60 | 40-45s | |
medium | 800K | 35 | 50s | Complex-ish query |
large | 8.5M | 35 | 6.5m | Complex-ish query |
x-large | 13M+ | 35 | 10-11m | Complex-ish query |
xx-large | 64M+ | 35 | 45m+ | Complex-ish query |
xxx-large | 89M+ | 44 | 62m+ | Complex-ish query |
I tested each workbook in a vacuum to see how long it would take to refresh when no other jobs were running. Then I used a little bit of Python and the Tableau Document API to point X copies of each workbook at one (of four) HAProxy load balancers.
Each load balancer fronts a fleet of 15-ish RDS instances. All-up, my test cluster has about 1400 workbooks using ~50 distinct database servers. The 50+ RDS instances spread across 4 load balancers (mostly) guarantees that the data tier and network throughput wouldn’t bottleneck / slow the extract refresh process too much.
When you view the results, you’ll need to be able to decipher the following secret code:
Windows-15-1000-2*m4.2xlarge-12BG
<platform>-<refresh jobs per minute>-<total number of scheduled jobs>-<number of ec2 instances> * <ec2 instance type and size> – <Total # Backgrounders running>
Translated:
- We ran two m4.2xlarges as backgrounder-only workers. Each backgrounder machine was running six backgrounder processes for a total of 12
- Fifteen new extract refresh jobs were submitted to Tableau Server per minute, for a total of one thousand jobs over the course of the test
- The test was performed on Windows
Windows-11-1000-2*c4.4xlarge-8BG
Above, we’re running two c4.4xlarge ec2 instances as backgrounders. Each machine hosts four backgrounder processes for a total of eight. On Windows, eleven jobs were submitted / minute until we reached one thousand.
Extract Refreshing Pattern
Most of the tests followed the same basic patterns:
- Tons of refreshes on the x-small and x-small workbooks
- Some refreshes on the medium workbooks
- Few refreshes on the large and x(xx, xxx)-large workbooks
I chose this workload spread based on what I’ve observed in Tableau Online. We have lots of small extract refreshes that often finish in thirty seconds to a minute, and a few “monster” refreshes that take over an hour. Interestingly, those few long-running refreshes often consume a LOT of backgrounder time vs. the small extracts (of which there are more).
Below you can see the relative proportions of different flavored extract refreshes:
In terms of raw count, x-small and small refreshes make up ~80% of the activity in this test. After that, 7% are large, 6% x-large. xx-large and medium workbooks come in at 3% each, and that little sliver of light blue at the far right represents xxx-large extract refreshes at .05%.
What’s interesting is that the xx-large workbooks make up 42% of total backgrounder execution time. My x-small workbooks only utilize 12.33% of the cycles during this test.
Therefore, if I’m an admin trying to “stretch” my backgrounder dollar, it quickly becomes clear that I want to cut down on xx-large executions: 3% of executions soaking up 42% of the actual work my machine does isn’t a good thing.
How Do I measure all This stuff?
The backgrounder_jobs table in the workgroup database gives you just about everything you need:
Created At: When did the extract job land in the queue
Started At: When did the extract job actually start executing (“Queue Wait Time” = Started – Created)
Completed: When did the extract job complete (“Run Time” = Completed – Started)
Job Name: You care about Refresh Extracts and Increment Extracts rows
Finish Code: 0 = Completed, 1 = Failed
Progress: -1 = Pending, 0 = (Finished / Success), 1-100: “Running”
If you’re feeling really nutty, you can even measure the length of the queue during your tests and save that info off somewhere. Here’s some Python that does just that:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import json | |
import psycopg2 | |
import sys | |
import time | |
# Source database & query which monitors job queue | |
connStringSource = "host=<hostname> port=8060 dbname=workgroup user=<tableau or read-only> password=<password>" | |
sqlSource= "SELECT job_name, \ | |
Case \ | |
WHEN progress = -1 then 'Pending'\ | |
WHEN progress < 100 then 'In Process'\ | |
WHEN finish_code = 0 then 'Success'\ | |
ELSE 'Error'\ | |
END as Status,\ | |
COUNT(job_name),\ | |
now() \ | |
FROM background_jobs \ | |
WHERE job_name IN ('Refresh Extracts', 'Increment Extracts', 'Subscription Notifications') \ | |
GROUP BY\ | |
job_name, Case \ | |
WHEN progress = -1 then 'Pending'\ | |
WHEN progress < 100 then 'In Process'\ | |
WHEN finish_code = 0 then 'Success'\ | |
ELSE 'Error'\ | |
END, now()" | |
# Target database to host results | |
connStringTarget = "host=foo.bar.rds.amazonaws.com port=5432 dbname=<some_database> user=<user> password=<pw>" | |
sqlTarget = "INSERT INTO backgrounder_activity (job_name, status, status_count, measurement_time) VALUES (%s,%s, %s, %s)" | |
while 1: | |
sourceConnection = psycopg2.connect(connStringSource) | |
sourceCursor = sourceConnection.cursor() | |
sourceCursor.execute(sqlSource) | |
# Get job queue | |
rows = sourceCursor.fetchall() | |
sourceCursor.close() | |
sourceConnection.close() | |
targetConnection = psycopg2.connect(connStringTarget) | |
targetCursor = targetConnection.cursor() | |
for row in rows: | |
targetCursor.execute(sqlTarget, (row[0], row[1], row[2],row[3] )) | |
targetConnection.commit() | |
#print (cursor.mogrify(sql, (1477, 'someMachine', timeslice['from'], metric['name'], timeslice['values']['average_value'] ))) | |
targetCursor.close() | |
targetConnection.close() | |
print('Data Recorded') | |
time.sleep(60) | |
#– CREATE TABLE STATEMENTS: | |
# CREATE TABLE public.backgrounder_activity | |
#( | |
# id integer NOT NULL DEFAULT nextval('backgrounder_activity_id_seq'::regclass), | |
# job_name character varying(50) COLLATE pg_catalog."default", | |
# status character varying(50) COLLATE pg_catalog."default", | |
# status_count integer, | |
# measurement_time timestamp without time zone, | |
# CONSTRAINT pk_id PRIMARY KEY (id) | |
#) | |
#WITH ( | |
# OIDS = FALSE | |
#) | |
#TABLESPACE pg_default; | |
# | |
#ALTER TABLE public.backgrounder_activity | |
# OWNER to postgres; | |
# | |
#– Index: idx_id | |
# | |
#– DROP INDEX public.idx_id; | |
# | |
#CREATE UNIQUE INDEX idx_id | |
# ON public.backgrounder_activity USING btree | |
# (id) | |
# TABLESPACE pg_default; | |
# | |
#– Index: idx_job_status | |
# | |
#– DROP INDEX public.idx_job_status; | |
# | |
#CREATE INDEX idx_job_status | |
# ON public.backgrounder_activity USING btree | |
# (job_name COLLATE pg_catalog."default", status COLLATE pg_catalog."default") | |
# TABLESPACE pg_default; | |
# | |
#ALTER TABLE public.backgrounder_activity | |
# CLUSTER ON idx_job_status; |
So that’s it campers. Here’s what you can expect over the next few posts:
- Part II: What sort of ec2 instances should I use as backgrounder hosts and why?
- Part III: Should I scale horizontally or vertically?
- Part IV: “Other Stuff” that might be interesting (we’ll see if this is necessary)