2017 Spark Summit East Reveals Progress but Not Disruption

This was a conference to marvel at — and not just for the lineup. How many gigs start with a New England Patriots Super Bowl victory parade and end with a quasi-blizzard? Boston’s “ambiance” aside, engaged attendees sunk their teeth into a feast of tech, much of it centered on real-time performance of analytics workflows, especially in the context of the latest buzz — machine and deep learning (ML and DL), along with AI. It’s hard to believe we’re in the third wave of AI! I started contributing during the second wave in the late 1980s, doing research with neural networks and heuristic search algorithms to auto-sort packages for shipping companies. Among the challenges in the latest wave, tech like Apache® Spark™ plays a significant role on the software side of performant systems.

Having proven you can realize revolutionary speedups running in-memory with fewer lines of code, much of what this conference showcased was evolutionary (vs. disruptive) progress to the Spark ecosystem. Targeting 10x-plus scale performance improvements for in-memory workflows translates directly to higher-performance infrastructure for streaming and database software.

Cotton Seed, who leads the software team for performant genomic variant analysis platform Hail at the Broad Institute, focused on the features in software infrastructure that enabled his team’s success to date in his keynote. In 2016, Seed’s team used the Urika platform to both increase productivity 3x and scale up data throughput more than 4x. Hail can now process nearly 200,000 whole human exomes in under a day on the system.

The Databricks training sessions gave a deep dive into Spark2. An important adjustment coders must make is the change from dataframes to datasets.

True to form, the Apache Spark community held an evening meetup with SnappyData’s Jags Ramnarayan leading off and featuring the SnappyData store (on GitHub). He was followed by Ted Malaska, a contributor to many Apache projects and currently at Blizzard Entertainment, who shared some hacks/tricks to make you a Spark rock star.

The Cray team had great conversations with industry thought leaders including Mike Gualtieri (@mgualtieri) with Forrester Research; Nik Rouda (@nrouda) at ESG; James Curtis (@jmscrts) at 451 Research; Mike Matchett (@smworldbigdata) with the Taneja Group and Doug Henschel (@DHenschen) with Constellation Research.

A couple of Cray’s new partners — Algebraix Data and Lightbend — were in attendance. Algebraix has already demonstrated massive query speedups on the Urika platform. Similarly, Lightbend has proven that running thousands of virtual machines on the Urika system is far more effective than using a public cloud.

On the surface, what appears to be a cloud-centric ecosystem continues to burgeon around Apache Spark. While CDOs focus on key data culture strategies — cloud migration, deployable data science assets, unified governance (across all platforms and data types), and people (with skills required to build resilient critical path pipelines) — most will agree the race to the public cloud is misguided for several reasons. In particular, the stiff penalties EUGDPR regulation violations might incur mean trusting a third party with your data amplifies risk. Moving your business-critical assets and workflows seamlessly takes time. The real challenge is wise migration to hybrid clouds — probably a multi-year journey when you properly manage the risks specific to the data flowing through your business.

Cray continues to work with datacenters to help you migrate to hybrid clouds. I addressed the genesis of Cray’s strategy to help you start your cloud migration during my talk, which featured Cray’s Urika®-GX analytics platform.

You can view videos and some PDFs of the presentations here.

Speak Your Mind

Your email address will not be published. Required fields are marked *