Spark Summit West is always well attended, and this year was no exception. Data engineers, data scientists, programmers, architects and technology enthusiasts descended on San Francisco’s Moscone Center earlier this month to learn all about the latest developments with Apache Spark™ and its massive ecosystem.
Complexity of analytics use cases and data science was a dominant theme throughout this year’s event. The keynote by the CEO and co-founder of Databricks, Ali Ghodsi, highlighted some of the challenges with implementing large-scale analytics projects. Ghodsi discussed how the continued growth of Apache Spark has resulted in myriad innovative uses cases, from churn analytics to genome sequencing. These applications are difficult to develop, as they often involve siloed teams of different domain experts; their complex workflows take too long from data access to insight; and the infrastructure is costly and difficult to manage.
AI, ML, DL
Data scientists like to explore data by transforming massive datasets and by building large-scale machine learning models. If you’re looking to experiment with machine learning and deep learning, Spark is as good a platform as any to start with. It continues to attract the most interest from academia and open-source developers.
Andy Feng and Lee Yang from Yahoo presented “TensorFlow on Spark: Scalable Tensorflow Learning on Spark Clusters.” TensorFlow is a new framework that enables easy experimentation for algorithm designs, and supports scalable training and inferencing on Spark clusters. It supports all TensorFlow functionalities, including synchronous and asynchronous learning, model and data parallelism and TensorBoard. It provides architectural flexibility for data ingestion to TensorFlow (pushing vs. pulling) and network protocols (gRPC and RDMA) for server-to-server communication. Its Python API makes the integration with existing Spark libraries like MLlib easy.
Jason Dai and Radhika Rangarajan discussed BigDL, which is a distributed deep learning framework for Apache Spark recently open sourced by Intel. BigDL helps make deep learning more accessible to the big data community by allowing them to continue using familiar tools and infrastructure to build deep learning applications. With BigDL, users can write their deep learning applications as standard Spark programs, which can then directly run on top of existing Spark or Hadoop clusters.
HPC for Spark
Apache Spark workloads typically maintain persistent data in memory which is frequently accessed over the network. Networking I/O performance is a critical component in Spark systems. HPC system’s performance characteristics, such as high bandwidth, low latency and low CPU overhead, offer an excellent opportunity to accelerate Spark by increasing network throughput.
Costin Iancu (Lawrence Berkeley National Laboratory) and Nicholas Chaimov (University of Oregon) presented their findings from their research porting Apache Spark to the Cray® XC™ line of supercomputers. Their talk focused on addressing the scalability bottleneck with the global file system present in all large-scale HPC installations. Using two techniques (file open pooling and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
Srivatsan Krishnan and Zhongyue Nah presented Intel’s design and implementation of FPGA as a supplement to vcores in Spark YARN mode to accelerate SparkML applications on the Intel Xeon+FPGA platform. In particular, they have added new options to Spark core that provides an interface for the user to describe the accelerator dependencies of the application. The FPGA info in the Spark context will be used by the new APIs and DRF policy implemented on YARN to schedule the Spark executor to a host with Xeon+FPGA installed. Experimental results using ALS scoring applications that accelerate general matrix-to-matrix multiplication operations demonstrate that Xeon+FPGA improves the FLOPS throughput by 1.5× compared to a CPU-only cluster.
My recommended talks
Videos and in some cases presentations are available for these and other sessions from Spark Summit West 2017:
“Databricks,” Ali Ghodsi and Greg Owen, Databricks (video and slides)
“BigDL: Bringing Ease of Use of Deep Learning For Apache Spark,” Jason Dai and Radhika Rangarajan, Intel (video)
“Apache Spark on Supercomputers: A Tale of The Storage Hierarchy,” Costin Iancu (LBNL) and Nicholas Chaimov (University of Oregon) (video)
“Accelerating SparkML Workloads on The Intel Xeon+Fpga Platform,” Srivatsan Krishnan and Zhongyue Nah, Intel (video)
“Speeding Up Spark with Data Compression on Xeon+FPGA,” David Ojika (University of Florida) (slides and video)
“Scaling Genetic Data Analysis with Apache Spark,” Jonathan Bloom and Timothy Poterba (Broad Institute of MIT and Harvard) (slides and video)
“Needle in the Haystack—User Behavior Anomaly Detection for Information Security,” Ping Yan and Wei Deng, Salesforce.com (video)