MLConf 2013 Roundup

This past Friday 11/15, Sharethrough Engineering attended the 2013 MLConf here in San Francisco where Netflix, Twitter, Yelp and others presented on large-scale ML trends and challenges.

Read on for thoughts of the day by a few members of the team - Michael Jensen, Michael Ruggiero and Ryan Weald!

Michael R.‘s Thoughts

For me a real highlight was the talk “Big Data Lessons in Music” given by Eric Bieschke, Chief Scientist at Pandora. Keeping algorithmics to a minimum, he explored the importance of of choosing the right metric: {“ 'How you judge experiments shapes where you are headed; choose the wrong measuring stick and you wind up in the wrong place.’ ”}

Jake Mannix of Twitter discussed content-based approaches to recommendation systems, particularly as an antidote to cold-start problems (e.g. a new ecommerce site that doesn’t yet have ratings). I’m not sure I would want to employ that kind of strategy, since latency would rocket skyward as the item set grows, but he pointed out that Twitter has a cold-start problem with new users (whose graph is small), and that their recsys employs a hybrid of content-based and collaborative filtering-based recommendations.

Quoc V. Le, of Google and Stanford, gave an interesting, though slightly schematic overview of Google’s DistBelief, which performs unsupervised learning across tens of thousands of cores. Quoc described its uses in image recognition and voice search, but I suspect that these applications look pretty tame compared to what Google plans to do with it.

Michael J.‘s Thoughts

Netflix

Netflix’s previous business model was all about DVD rentals by mail, so users were much more picky about what they put in their queue, and found a lot of value from awesome recommendations. The cost of a bad recommendation was high, as users would have to wait several days to get the DVD, watch it, return it and rate it.

Now, Netflix is more about streaming, with 40 million users, there were 5 billion hours of video streamed in Q3 of 2013. The bottleneck of delivering DVDs in the mail is gone, and Netflix now gets about 5 million ratings every day. Users now make impulsive decisions about what to watch, and are happy to abandon content if it isn’t to their liking. Because users can watch many different pieces of content with relatively little investment, the “search space” of movies and TV shows can be explored much more quickly.

Modern Netflix recommendations are all about grouping content by similarity across many vectors. User Behaviour can be a strong indicator of how other people consume media (Other people who watch Parks and Recreation also watched The Office), but tagging content with metadata can lead to very interesting categories for recommendations. The Netflix home screen now features many rows of content, grouped by similarity. Genres are generated from pools of tags, leading to collections of Independent Comedies with a Strong Female Lead.

Netflix is probably the gold standard for recommendations right now, with LinkedIn’s “People you may know” a close second. Their powerful system provides such good recommendations that only 25% of views are started from a search, the other 75% coming from a piece of recommended content.

Yelp Recommendations

In the Yelp iOS app, the “nearby” tab used to be a facade for search. A small project team was broken off with about three engineers and some front end developers to work on the new Nearby tab.

Unlike at Netflix, context is very important, a user’s location or the time of day will eliminate 95% - 99% of Yelp’s business database as valid results.

In the initial version, the team had very little ML experience. They knew they would have all of Yelp’s users and data on day one with no gradual rollout! They had a small team, and they were hoping this product would be long lived, so they had to build for the future. The following principles were distilled:

  • There was a big data retrieval problem (95-99% of data is useless for each request), database choices were limited by the domain (fast geo search, time of day filtering etc)
  • Build for what you have, but plan for expansion
  • Goal is a great product, not a benchmark

The architecture works by fanning out search requests to systems called Experts. Each Expert will apply it’s algorithms and provide zero or more recommendations. One example is the “Liked by Friends” expert which uses user data to find what your friends like and recommend those results, with location and time of day filtering applying to all experts.

The system aggregates the results from each Expert and makes a “wise” decision for this user, by ranking the Expert’s recommendations. A good example is if you’re traveling to a new city and the “Liked by Friends” expert suggests a coffee shop, users will be very likely to take the recommendation since it’s unexpected.

Learnings:

  • Solve your own problem (e.g. Netflix ML won’t help at Yelp)
  • Build for what you have, plan for the future
  • Internal iterations pre-launch are very useful (dogfooding).

Ryan’s Thoughts

The majority of ML presented was being applied to recommendation systems in various forms. Three themes emerged:

  1. First was the increased interest in so called “deep learning” which is neural networks with more than 3 hidden layers. This was by far the most popular topic in the more academic track of talks. In particular there was a focus on how to implement the distributed computing necessary to compute these huge neural networks on large data sets.
  2. The second major theme was the “productionalization” of recommendation systems. This focused on picking the right metrics to optimize for and choosing algorithms that were understandable in production. Scott Triglia from Yelp talked about how they choose simpler algorithms that can be more easily composed and monitored, which allows complexity to be introduced iteratively without creating a black box. Both Pandora and Netflix both discussed the importance of finding the right error function to optimize your recommendations towards. There was a particularly impactful comment from Xavier Amatriain of Netflix: “Social popularity is the baseline for all recommnendation systems, and if you can’t beat 'most popular’ then you need to go back to the drawing board.” He also cited evidence that showed popular was not a great recommendation for them as they got a 20x improvement when they moved to personalized recommendations rather than pure popularity.
  3. The third and final theme was the importance of alternative infrastructure for data processing. This was not talked about too directly, other than the spark talk, but it was an undertone throughout. Most companies talked about building ETL pipelines using Hadoop and then building/training models using some other system that has more natural iterative programming abstractions than Hadoop.