Elk-ing All the Way
Sharethrough builds web infrastructure horizontally to cope with demand. This means our logs live across several servers both for our descisioning engine and front-ends. We looked at different technologies for consolidating and aggregating log data, and finally settled on the ELK stack: Elasticsearch, logstash, and Kibana.
Getting Started
We rely heavily on Chef Supermarket for package recipes and in the case of ELK, installing a log collector and forwarders using community cookbooks is straightforward. However, Supermarket hosted only version 0.3.10 of the Elasticsearch cookbook, when we wanted 0.3.11. We documented the workaround for installing the latest Elastisearch here.
We use grok to tokenize the log line into categories on the collector. It has a rich set of predefined filters and it can be a daunting task to get the line parsed correctly. The grok debugger is a great tool to help in that area. It enables us to experiment and validate different filter combinations to collect appropriate data before rolling them into production.
Reducing Load, Increasing Performance and Happier Partners
With our newfound visibility, we identified a surge of 4xx status codes within a 20 minute interval. Further investigation revealed a publisher requesting a non-existing resource during that time, each night. However, the cache time on our CDN for 4xx responses was not set, so we reconfigured the CDN to mitigate the impact of large scale 4xx demand. This can happen when a large publisher has issues with their integration, and with ELK we were able to identify the publisher and work with them to correct their integration.
We also discovered that customizing CDN settings for other classes of static resources can utilize the CDN better and were able to reduce the number of front-end requests by 75% (!) with a higher cache time.
Ad Hoc Insight
Sharethrough provides an iOS SDK to our publishers and we can use ELK as a quick, ad hoc tool for a high-level understanding of our audience (though we primarily rely on Spark for this). Grouping user agents shows iOS is the dominant platform. Our Android SDK is in active development, and when it comes online we’ll rely on ELK to measure the Android adoption by stacking against the iPhone UAs.
ELK has only been in place a short time and it’s already boosted our operational visibility, and more importantly shortened the time it take for engineering to understand and respond to operational issues. We’re constantly finding new insights from the simplicity of Kibana’s method of aggregating log data.
If you’re interested in digging for data nuggets with us, we’re hiring!