This talk deep dives into how Facebook managed to convert a gigantic Hive batch processing job that uses 6000 CPU days to run on Spark with 1/4 CPU at 1/4 latency. To accomplish this, we made numerous stability and performance improvements to Apache Spark, tuned configurations and optimized our business logic.
Nominated among the Top 10 blog posts of 2016 from Apache Spark, this talk describes the experiences and lessons learned while scaling Spark to replace one of Facebook's Hive workloads. Examples include taking one of the existing pipelines and migrating it to spark to enable fresher feature data, and improve manageability. This led to major realiability improvements including making the PipedRDD more robust to fetch failure gracefully, as well as a less disruptive cluster restart. In addition, performance optimizations were also made as part of the migration to spark such as reducing shuffle write latency which led to a CPU improve of up to 50% for jobs writing a high number of shuffle partitions.
Shuojie Wang is a Software Engineer at Facebook where he works with feed data, video delivery, distributed cron-like systems, and Facebook's Super Computer. Previously, he was an instructor at the China Welfare Institute Children's Palace where he taught National Olympiad in Informatics at both high school and middle school levels.