COPY Into Redshift With Ease, With Flare-up!

Redshift is Amazon’s “petabyte-scale data warehouse solution.” Under the hood it’s a custom, distributed version of Postgres 8.0 whose provisioning and configuration, like RDS, Amazon manages on your behalf. What’s terrific about Redshift are the price, speed and abstraction that SQL provides, on top of a high-capacity, distributed SQL store.

One of Redshift’s most interesting features is also the trickiest to execute.

Motivation

Execution

The Redshift COPY command is the means by which data is bulk-loaded on to Redshift. It parallelizes loading data off of S3, EMR, DynamoDB or other remote hosts. Postgres aficionados will notice that it is different than the Postgres COPY command and as such, has additional error conditions present only in the Redshift documentation.

Briefly, why not INSERT? Redshift is optimized for querying, not for insertion hence the INSERT command is more of an exceptional use case and “might be prohibitively slow.

This operation cannot be kicked off with a call to the Redshift API whose responsibilities are provisioning and configuration of Redshift instances. COPY is executed via the SQL engine and as such, a database connection is required.

If you wish to automate COPY, your options are:

  • Amazon Data Pipeline: We’re found this to be arcane and difficult to use; not our first choice. Some day, it may be.
  • Postgres Client: You could install the Postgres client on your jobflow control machines (we use Azkaban) and wire scripts together to handle the call to the postgres CLI.

…at which point you’d also have to handle three classes of errors:

  1. SQL errors constructing the COPY command (e.g. forgetting ‘quotes’ around the OPTIONS portion of the command).
  2. Errors validating COPY preconditions (e.g. cannot access source data).
  3. Errors during the COPY operation itself (e.g. the 2,000th row is missing a column).

Each of these errors is reported differently:

  1. A PG::SyntaxError exception.
  2. A PG::InternalError exception.
  3. Population of the STL_LOAD_ERRORS Redshift table.

What’s difficult with the STLLOADERRORS reporting is that it’s out of band. If COPY fails during the import, you must break your flow from understanding the step failure in your jobflow tool (likely examining STDOUT/STDERR), find your SQL tool, pull down the Redshift credentials and query a table in Redshift. Implicitly this means you both remember the name of the table, its schema and how it should be interpreted.

Flare-up!

Flare-up is a gem-based command-line utility that wraps access to the Redshift COPY command. Because it relies on Rubygems, it handles installation of the pg gem and native extensions, relieving you of having to install the Postgres client yourself.

Here’s a sample invocation.

1
2
3
4
5
6
7
8
> flare-up                                                      \
    copy                                                        \
    s3://slif-redshift/hearthstone_cards_short_list.csv         \
    flare-up-test.cskjnp4xvaje.us-west-2.redshift.amazonaws.com \
    dev                                                         \
    hearthstone_cards                                           \
    --column-list name cost attack health description           \
    --copy-options "REGION 'us-east-1' CSV"                      
  1. Time to flare up!
  2. copy subcommand, leaving room for others (e.g. flare errors to show errors?)
  3. Data source
  4. Your Redshift endpoint
  5. Destination database name
  6. Destination table name
  7. Optionally specified list of columns in the CSV, not all data sources require this
  8. My S3 bucket is in a different region than my Redshift databsae another region
  9. The contents of the data source are a CSV

Amazon’s done a great job of providing a wide degree of flexibility with the COPY command and on the documentation as well, which is loaded with examples.

We hope Flare-up helps you work with Redshift! Let us know of any suggestions or issues at the project page.