If a tree falls in the woods...

At Sharethrough many demand-side platforms (or DSPs) bid into our exchange via RTB. In turn, our adserver logs events as they occur.

The fast-moving nature of this log data inevitably brings us to contemplate two topics that are often intertwined: real-time data and dashboards.

It’s straightforward enough to pipe your data into Kinesis or Kafka and have data in real-time. But whenever someone asks you for Real-Time Data (especially in dashboard form), be sure to ask: “Can you do anything with this data in real-time? What decision can you make in real time with this real-time data?”

For example, if requests into your adserver dropped 50% in the last 10 minutes, you need to act. But observing that, say, in the last ten minutes, fewer people are clicking on an ad, that is less actionable. And vanity metrics are never needed real-time, or in any dashboard.

In our case, we needed to address a real problem: as we work with more partners and support more features, more events flow through our pipeline. We needed to know What is happening and Why it’s happening.

We’re pretty good with the What part. We use Librato to capture metrics about What is happening. For example:

  • How many requests are we getting?
  • What’s the load average on our servers?

These gave us the health of our network. What they didn’t do is give us a picture of Why things were happening:

  • Why is a bidder receiving errors when trying to bid into our system?
  • Why isn’t a bidder winning?

What we needed was log analysis across various services and event streams, with a bias towards recency. We had tooling that enabled engineers to write Spark code against S3 log buckets, but the size of our cluster – and the fact that it was shared across all of engineering – meant that execution times varied wildly. So we decided to build our stripped down CQRS system.

Prototyping

Our first task was to try to get the event stream into some data store that would allow us to query hundreds of millions of rows of logs in the magnitude of tens of seconds instead of tens of minutes (remember we’re only keeping recent data). Redshift seemed to be the natural choice given the requirements and the fact that our tech stack is heavily AWS. In fact, most of our logs were already being sunk to S3 and Kinesis. We then had to figure out how to get data from S3/Kinesis into Redshift. Given the tooling AWS provides for integration between services, we decided to use Lambda.

Our first iteration:

  1. Logs ship to S3 in near real-time
  2. S3 triggers an SNS notification
  3. SNS notification triggers a Lambda
  4. Python parses and sends records to Kinesis Firehose
  5. Kinesis Firehose buffers the records on S3 before periodically calling a Redshift Copy to load into Redshift

This worked great. Getting it working took a day, and after some Redshift tuning, it met our latency requirements.

Extending our MVP

Now that we had one event stream in, we needed to be able to load multiple other event streams from various services into Redshift. Programmatically, adding event streams was simple, but scaling the Lambda ETL presented new challenges. We share our Lambda service limit with different teams within engineering and other services had stricter SLAs. We shifted to something we’re pretty familiar with: Scala.

Instead of relying Python Lambdas triggered by SNS, we switched to a Scala worker pulling off of SQS. This took more lines of code than the Lambda (parsing JSON in Scala is nowhere as painless as it is with Python). But in the end, it worked just as well. (We did end up keeping a scheduled Lambda that would expunge old data from Redshift tables.)

That’s where we sit now: given a unique request, we’re able to trace the life of that request through our system. While we’re still adding and discovering more use cases for this, we feel better knowing we have a way to investigate a Why in minutes instead of hours.

Circling back

Building a service isn’t sufficient. To know the service is working, you have to monitor it. Sometimes monitoring metrics isn’t sufficient. You have to be able to ask the service Why things are happening. We decided building and maintaining a system to answer questions was better than spending engineering time to answer each question ad hoc.

Sometimes knowing the tree fell isn’t sufficient, you have to know why it fell. And to do so, you need to be able to look at everything happening around all the trees.

What’s next?

Maybe we’ll build a dashboard on top of it!