Apache Hadoop for Scientific Big Data – Maslow’s Hammer or Swiss Army Knife?

The Law of the Instrument (aka Maslow’s Hammer) is an over-reliance on a familiar tool — or, as my grandfather used to say “when the only tool you have is a hammer, every problem begins to resemble a nail.”

I’m sure this principle comes to mind as many professionals within the scientific computing community are being asked to investigate the appropriateness of Apache Hadoop® — today’s most familiar Big Data tool. When used inappropriately, and incorporating technologies not suited for scientific Big Data, using Apache Hadoop may indeed feel like wielding a cumbersome hammer. But when used appropriately, and with a technology stack that’s specifically suited to the realities of scientific Big Data, Hadoop can feel like a Swiss Army knife — a multipurpose tool capable of doing a wide range of things. Of course, whether Hadoop feels like a Swiss Army knife or not depends not only on the experience level of the user, but also on whether it’s designed and implemented for scientific Big Data. And scientific Big Data is different from the Big Data much of the rest of the world is dealing with. Here’s why:

Hammering against scientific data files

Scientific Big Data is often about big, hairy source files — not just the aggregated click streams or web content that Hadoop was originally designed for, but huge hierarchical files generated via computationally intensive processes. While Hadoop can analyze these files, about 99 percent of the world’s Hadoop implementations do so with a level of efficiency akin to cracking open a bottle of vino with a 5 pound sledgehammer — you can use it, but there’s a lot of waste.

Apache Hadoop is beautifully simplistic by design and was originally intended for small chunks of data that are aggregated into larger files and then analyzed in their entirety for every join or query. This approach is very useful when you’re looking at getting user sentiment across millions of Twitter feeds, and it can even be very useful for scientific Big Data applications where huge volumes of sensor data are aggregated into one volume. But it breaks down when you’re analyzing a 30-terabyte seismic or weather model file within a scientific Big Data application. Its difference is expressed in the delta between analyzing every block of that 30-terabyte file and crunching only the 30 megabytes you’re actually interested in.

Apache Hadoop can indeed feel like a blunt and inappropriate hammer when this inefficient process of analyzing unnecessary blocks of data is repeated dozens or hundreds of times a day. This is a case where random access to files is necessitated, and frankly, that’s not in HDFS’s (the Hadoop Distributed File System) wheelhouse.

Fortunately, by its inherent Swiss Army knife-like design, Hadoop can be modified to do more than just hammer. By leveraging the goodness of MapReduce and underlaying HDFS with a POSIX compliant file system such as Lustre, organizations can bypass areas of the file they know won’t be of interest and analyze large hierarchical files more efficiently. While many would argue this approach isn’t Hadoop but MapReduce on a POSIX-compliant file system, it should be done in a way that doesn’t impede the operations of the other tools in the Swiss Army knife (aka the MapReduce ecosystem) to do their jobs. In other words, the storage has to be presented to MapReduce and its constituents as if it was HDFS, even though another file system lies underneath it. And yes, that’s one of the ways we are looking at for designing our Apache Hadoop solutions here at Cray®.

Leveraging or replacing scientific storage

Hierarchical source file complexities aside, standard Apache Hadoop implementations aren’t the best fit for many scientific environments for other reasons. For one, these organizations already have standard file systems that they deal with on a regular basis and their job scheduling and job workflow are built around Posix file access. By introducing HDFS, if it is not closely tied to these workflows, these organizations are being asked to adopt a filesystem that doesn’t easily fit in this environment. Let’s look at the impact of introducing HDFS-only in an environment with 1 petabyte of scientific data.

Using HDFS for 1 petabyte of scientific data:

  • Extract, Transform, Load (ETL). Every block of data has to be extracted from its source file system, transported over the wire, and ingested into HDFS. This method is computationally expensive and will likely consume a great deal of I/O. Suffice it to say that once the data has been copied, there will be a reluctance to update it.
  • Initial Duplication. Assuming 1 petabyte lives on the native POSIX-compliant file system, an additional petabyte will now have to live within HDFS.
  • HDFS Duplication. HDFS will expect to make two additional copies of each block of data for availability. That’s another 2 petabytes of storage.
  • Combined Impact. If you’re following along, the net of this is a petabyte of ETL and three petabytes of spinning storage. That’s 1,000 3-terabyte disks with a lot of associated infrastructure overhead. Oh, and that ETL process will be a bit of a bear as well.

Using a POSIX-compliant file system for 1 petabyte of scientific data:

  • Ideally, one could mount their POSIX compliant volume, with MapReduce.

Yes, you’ll likely generate a heck of a lot of data on the MapReduce side. You’ll likely also be dragging a lot of data that is actually very suitable for HDFS such as streaming sensor data. And some scenarios will require the ingest all of scientific data into a volume that’s dedicated to Hadoop, but Cray is working to ensure that you won’t have to. And once it’s there, you won’t be obligated to take on a 200 percent availability overhead tax as the file system you’re using will likely require 15-20 percent RAID parity which most organizations find appropriate for even their most mission-critical data.   At least, that’s how Cray is designing Apache Hadoop solutions for scientific Big Data.

Using HPC as a Swiss Army knife

Yes, your HPC environment (certainly your Cray HPC environment) can excel with both analytical and compute-intensive workloads. Furthermore, with some modifications to the Apache Hadoop stack and supporting infrastructure (things Cray has been working on), it can Hadoop far more efficiently and even cost effectively than running those jobs on ad-hoc distributed infrastructure. We’ve been making significant strides in this area, and we continue to improve.  But beyond the benefits of one Cray cabinet being able to do the work of distributed gear on, say, five distributed cabinets, there are some compelling reasons organizations might choose to have both scientific compute and Hadoop workloads on the same infrastructure.

By leveraging the same infrastructure for both types of workloads, organizations have the flexibility of adapting their infrastructure and workflows to suit the needs of their project at a specific point in time. They can better collaborate across disciplines to address grand challenges that aren’t so easily placed in one bucket or another. If done right, they can adaptively manage disparate workloads and workflows in parallel (aka resource management and job scheduling). And they can more easily share budgetary and staffing resources among departments, instead of fighting for them across fiefdoms.

Embracing Hadoop as a Swiss Army knife

It’s not lost on me that the analogy of a Swiss Army knife is probably abhorrent to many in the HPC community. Whereas a Swiss Army knife is a jack-of-all-trades and master of none, HPC is about applying the best possible tools for a specific job, often at the cost of greater expense or complexity. And I believe many HPC professionals look at technologies like Apache Hadoop, heck specifically at Hadoop, with the disdain that one would when being asked to take a Ronco pocket fisherman to a Bass tournament.

Yes, Hadoop has some perceived and even some real performance issues. It was initially conceived in the service provider world where huge staffs maintain thousands upon thousands of cheap white boxes by spending their days applying the DevOps equivalent of duct tape and bailing wire with Perl scripts and Ruby. It’s somewhat green, and still requires a great deal of fiddling to get right. And it’s out of the box design seems to be a full 180 degrees off of anything HPC stands for. But I believe it’s going to have a prominent role in your datacenter in the not too distant future.

History repeats itself — almost 20 years ago I was “THE Linux competitive guy” for a UNIX company. Linux was somewhat rough around the edges and organizations were having difficulty getting it to reliably scale to meet anything close to mission-critical expectations, but still . . .   Linux felt inevitable to me. Hadoop has that same feel today. Looking at the Hadoop stack, it seems to do the work of about a dozen commercial software companies. Both public and private institutions are embracing it, and it’s getting a lot of traction with industry vendors. Fortunately, the framework is flexible enough to be adapted to suit the needs of a wide range of environments. Yes, Hadoop is coming. And much like the alien substance in the 1958 sci-fi film “The Blob,” it’s intent on ingesting your scientific data. And yes, Cray is dedicating resources to ensure it meets the needs of organizations looking to leverage Hadoop to scientific big data.


  1. 4

    weitao cong says

    As I just read this article, I believe that the Apache Hadoop is definitely an excellent solution for scientific big data. It would be greatly appreciated that if you could share some information about the project. I really want to know whether the solution could widely and practically be used in HPC, including computational materials, CAE, seismic and weather, etc. Sincerely, send me email: [email protected] or [email protected].

  2. 5


    I think its a shame to pick on HDFS so much. That replicated data comes out of commodity JBOD storage, not enterprise-priced RAID-0, RAID-1, RAID-5 storage, so saying its 3x replication only holds up if you forget about that in-storage bit duplication, whether its raid-5 or erasure coding.

    That replication also means that each 256MB block of a large file is on 3 servers where the data is local -where the MR engine tries to schedule the work, and is spread across two racks so the loss of a rack/ToR switch isn’t going to lose access to the data.

    If you don’t want that much replication and you do have some compelling need to keep it on a “legacy” posix FS, well, stick it on with 2x replication and you still get two nodes, two racks.

    Importing the data shouldn’t be that hard either: its what MapReduce is made for. Indeed, the distcp app is designed to massively parallel copying of data from one Hadoop cluster to another. It is what you use to move a few PB over the wire, whether the wire is a 10 Gbe trunk to the next rack, or something to a remote site. In fact, distcp is so effective at doing this (scheduling its work on the source nodes, obviously), that you need to throttle it back on the long-haul links unless you want to lose the entire off-site bandwidth of your datacentre

    Finally: why keep the original data in a premium HPC fs? If it’s bulk data coming off an experiment, look at the cost /TB of storing it and consider how much extra storage capacity HDFS-managed-disk storage gives you for the same money. Then start putting your cold data into HDFS…

  3. 6


    Interesting comments about the potential for storage being shared between HPC and Hadoop workloads. You talk about how HDFS, despite being “the” storage engine of Apache Hadoop, isn’t the only choice out there. I’d be curious to know your thoughts about the computation side of things. MapReduce, as you mention, is perhaps the best known part of Hadoop, but it’s no longer the only execution framework being deployed in a “Hadoop” cluster. Although not technically a part of Hadoop itself, products like Impala, Tez and Spark are all advancing the way relational queries, graph /stream processing, machine learning, etc. are done within the so-called “Hadoop ecosystem”. Do any of these frameworks strike out to you as being especially good candidates for colocation with traditional HPC workloads, like the seismic and weather data you mention?

    • 7

      Mike Boros says


      Outstanding question!

      At this phase, our interest is in MapReduce rather than execution alternatives. Now, having said that, there are other things in the works. These are things I’m not at liberty to publicly discuss, at this time. But I can say that Cray is committed to improving computational performance of MapReduce as well as I/O and storage. In other words, stay-tuned… 😉

Speak Your Mind

Your email address will not be published. Required fields are marked *