How Spotify ran the largest Google Dataflow job ever for Wrapped 2019

Frederic Lardinois
LOS ANGELES, CALIFORNIA - JANUARY 23: (L-R) Billie Eilish and Finneas O'Connell perform onstage during Spotify Hosts "Best New Artist" Party at The Lot Studios on January 23, 2020 in Los Angeles, California. (Photo by Frazer Harrison/Getty Images for Spotify)

In early December, Spotify launched its annual personalized Wrapped playlist with its users' most-streamed sounds of 2019. That has become a bit of a tradition and isn't necessarily anything new, but for 2019, it also gave users a look back at how they used Spotify over the last decade. Because this was quite a large job, Spotify gave us a bit of a look under the covers of how it generated these lists for its ever-growing number of free and paid subscribers.

It's no secret that Spotify is a big Google Cloud Platform user. Back in 2016, the music streaming service publicly said that it was going to move to Google Cloud, after all, and in 2018, it disclosed that it would spend at least $450 million on its Google Cloud infrastructure in the following three years.

It was also back in 2018, for that year's Wrapped, that Spotify ran the largest Google Cloud Dataflow job ever run on the platform, a service the company started experimenting with a few years earlier. "Back in 2015, we built and open-sourced a big data processing Scala API for Apache Beam and Google Cloud Dataflow called Scio," Spotify's VP of Engineering Tyson Singer told me. "We chose Dataflow over Dataproc because it scales with less operational overhead and Dataflow fit with our expected needs for streaming processing. Now we have a great open-source toolset designed and optimized for Dataflow, which in addition to being used by most internal teams, is also used outside of Spotify."

For Wrapped 2019, which includes the annual and decadal lists, Spotify ran a job that was five times larger than in 2018 -- but it did so at three-quarters of the cost. Singer attributes this to his team's familiarity with the platform. "With this type of global scale, complexity is a natural consequence. By working closely with Google Cloud’s engineering teams and specialists and drawing learnings from previous years, we were able to run one of the most sophisticated Dataflow jobs ever written."

Still, even with this expertise, the team couldn't just iterate on the full data set as it figured out how to best analyze the data and use it to tell the most interesting stories to its users. "Our jobs to process this would be large and complex; we needed to decouple the complexity and processing in order to not overwhelm Google Cloud Dataflow," Singer said. "This meant that we had to get more creative when it came to going from idea, to data analysis, to producing unique stories per user, and we would have to scale this in time and at or below cost. If we weren’t careful, we risked being wasteful with resources and slowing down downstream teams."

To handle this workload, Spotify not only split its internal teams into three groups (data processing, client-facing and design, and backend systems), but also split the data processing jobs into smaller pieces. That marked a very different approach for the team. "Last year Spotify had one huge job that used a specific feature within Dataflow called "Shuffle." The idea here was that having a lot of data, we needed to sort through it, in order to understand who did what. While this is quite powerful, it can be costly if you have large amounts of data."

This year, the company's engineers minimized the use of Shuffle by using Google Cloud's Bigtable as an intermediate storage layer. "Bigtable was used as a remediation tool between Dataflow jobs in order for them to process and store more data in a parallel way, rather than the need to always regroup the data," said Singer. "By breaking down our Dataflow jobs into smaller components -- and reusing core functionality -- we were able to speed up our jobs and make them more resilient."

Singer attributes at least a part of the cost savings to this technique of using Bigtable, but he also noted that the team decomposed the problem into data collection, aggregation and data transformation jobs, which it then split into multiple separate jobs. "This way, we were not only able to process more data in parallel, but be more selective about which jobs to rerun, keeping our costs down."

Many of the techniques the engineers on Singer's teams developed are currently in use across Spotify. "The great thing about how Wrapped works is that we are able to build out more tools to understand a user, while building a great product for them," he said. "Our specialized techniques and expertise of Scio, Dataflow and big data processing, in general, is widely used to power Spotify’s portfolio of products."