The Law of the Instrument (aka Maslow’s Hammer) is an over-reliance on a familiar tool — or, as my grandfather used to say “when the only tool you have is a hammer, every problem begins to resemble a nail.”
I’m sure this principle comes to mind as many professionals within the scientific computing community are being asked to investigate the appropriateness of Apache Hadoop® — today’s most familiar Big Data tool. When used inappropriately, and incorporating technologies not suited for scientific Big Data, using Apache Hadoop may indeed feel like wielding a cumbersome hammer. But when used appropriately, and with a technology stack that’s specifically suited to the realities of scientific Big Data, Hadoop can feel like a Swiss Army knife — a multipurpose tool capable of doing a wide range of things. Of course, whether Hadoop feels like a Swiss Army knife or not depends not only on the experience level of the user, but also on whether it’s designed and implemented for scientific Big Data. And scientific Big Data is different from the Big Data much of the rest of the world is dealing with. Here’s why:
Hammering against scientific data files
Scientific Big Data is often about big, hairy source files — not just the aggregated click streams or web content that Hadoop was originally designed for, but huge hierarchical files generated via computationally intensive processes. While Hadoop can analyze these files, about 99 percent of the world’s Hadoop implementations do so with a level of efficiency akin to cracking open a bottle of vino with a 5 pound sledgehammer — you can use it, but there’s a lot of waste.
Apache Hadoop is beautifully simplistic by design and was originally intended for small chunks of data that are aggregated into larger files and then analyzed in their entirety for every join or query. This approach is very useful when you’re looking at getting user sentiment across millions of Twitter feeds, and it can even be very useful for scientific Big Data applications where huge volumes of sensor data are aggregated into one volume. But it breaks down when you’re analyzing a 30-terabyte seismic or weather model file within a scientific Big Data application. Its difference is expressed in the delta between analyzing every block of that 30-terabyte file and crunching only the 30 megabytes you’re actually interested in.
Apache Hadoop can indeed feel like a blunt and inappropriate hammer when this inefficient process of analyzing unnecessary blocks of data is repeated dozens or hundreds of times a day. This is a case where random access to files is necessitated, and frankly, that’s not in HDFS’s (the Hadoop Distributed File System) wheelhouse.
Fortunately, by its inherent Swiss Army knife-like design, Hadoop can be modified to do more than just hammer. By leveraging the goodness of MapReduce and underlaying HDFS with a POSIX compliant file system such as Lustre, organizations can bypass areas of the file they know won’t be of interest and analyze large hierarchical files more efficiently. While many would argue this approach isn’t Hadoop but MapReduce on a POSIX-compliant file system, it should be done in a way that doesn’t impede the operations of the other tools in the Swiss Army knife (aka the MapReduce ecosystem) to do their jobs. In other words, the storage has to be presented to MapReduce and its constituents as if it was HDFS, even though another file system lies underneath it. And yes, that’s one of the ways we are looking at for designing our Apache Hadoop solutions here at Cray®.
Leveraging or replacing scientific storage
Hierarchical source file complexities aside, standard Apache Hadoop implementations aren’t the best fit for many scientific environments for other reasons. For one, these organizations already have standard file systems that they deal with on a regular basis and their job scheduling and job workflow are built around Posix file access. By introducing HDFS, if it is not closely tied to these workflows, these organizations are being asked to adopt a filesystem that doesn’t easily fit in this environment. Let’s look at the impact of introducing HDFS-only in an environment with 1 petabyte of scientific data.
Using HDFS for 1 petabyte of scientific data:
- Extract, Transform, Load (ETL). Every block of data has to be extracted from its source file system, transported over the wire, and ingested into HDFS. This method is computationally expensive and will likely consume a great deal of I/O. Suffice it to say that once the data has been copied, there will be a reluctance to update it.
- Initial Duplication. Assuming 1 petabyte lives on the native POSIX-compliant file system, an additional petabyte will now have to live within HDFS.
- HDFS Duplication. HDFS will expect to make two additional copies of each block of data for availability. That’s another 2 petabytes of storage.
- Combined Impact. If you’re following along, the net of this is a petabyte of ETL and three petabytes of spinning storage. That’s 1,000 3-terabyte disks with a lot of associated infrastructure overhead. Oh, and that ETL process will be a bit of a bear as well.
Using a POSIX-compliant file system for 1 petabyte of scientific data:
- Ideally, one could mount their POSIX compliant volume, with MapReduce.
Yes, you’ll likely generate a heck of a lot of data on the MapReduce side. You’ll likely also be dragging a lot of data that is actually very suitable for HDFS such as streaming sensor data. And some scenarios will require the ingest all of scientific data into a volume that’s dedicated to Hadoop, but Cray is working to ensure that you won’t have to. And once it’s there, you won’t be obligated to take on a 200 percent availability overhead tax as the file system you’re using will likely require 15-20 percent RAID parity which most organizations find appropriate for even their most mission-critical data. At least, that’s how Cray is designing Apache Hadoop solutions for scientific Big Data.
Using HPC as a Swiss Army knife
Yes, your HPC environment (certainly your Cray HPC environment) can excel with both analytical and compute-intensive workloads. Furthermore, with some modifications to the Apache Hadoop stack and supporting infrastructure (things Cray has been working on), it can Hadoop far more efficiently and even cost effectively than running those jobs on ad-hoc distributed infrastructure. We’ve been making significant strides in this area, and we continue to improve. But beyond the benefits of one Cray cabinet being able to do the work of distributed gear on, say, five distributed cabinets, there are some compelling reasons organizations might choose to have both scientific compute and Hadoop workloads on the same infrastructure.
By leveraging the same infrastructure for both types of workloads, organizations have the flexibility of adapting their infrastructure and workflows to suit the needs of their project at a specific point in time. They can better collaborate across disciplines to address grand challenges that aren’t so easily placed in one bucket or another. If done right, they can adaptively manage disparate workloads and workflows in parallel (aka resource management and job scheduling). And they can more easily share budgetary and staffing resources among departments, instead of fighting for them across fiefdoms.
Embracing Hadoop as a Swiss Army knife
It’s not lost on me that the analogy of a Swiss Army knife is probably abhorrent to many in the HPC community. Whereas a Swiss Army knife is a jack-of-all-trades and master of none, HPC is about applying the best possible tools for a specific job, often at the cost of greater expense or complexity. And I believe many HPC professionals look at technologies like Apache Hadoop, heck specifically at Hadoop, with the disdain that one would when being asked to take a Ronco pocket fisherman to a Bass tournament.
Yes, Hadoop has some perceived and even some real performance issues. It was initially conceived in the service provider world where huge staffs maintain thousands upon thousands of cheap white boxes by spending their days applying the DevOps equivalent of duct tape and bailing wire with Perl scripts and Ruby. It’s somewhat green, and still requires a great deal of fiddling to get right. And it’s out of the box design seems to be a full 180 degrees off of anything HPC stands for. But I believe it’s going to have a prominent role in your datacenter in the not too distant future.
History repeats itself — almost 20 years ago I was “THE Linux competitive guy” for a UNIX company. Linux was somewhat rough around the edges and organizations were having difficulty getting it to reliably scale to meet anything close to mission-critical expectations, but still . . . Linux felt inevitable to me. Hadoop has that same feel today. Looking at the Hadoop stack, it seems to do the work of about a dozen commercial software companies. Both public and private institutions are embracing it, and it’s getting a lot of traction with industry vendors. Fortunately, the framework is flexible enough to be adapted to suit the needs of a wide range of environments. Yes, Hadoop is coming. And much like the alien substance in the 1958 sci-fi film “The Blob,” it’s intent on ingesting your scientific data. And yes, Cray is dedicating resources to ensure it meets the needs of organizations looking to leverage Hadoop to scientific big data.