You say HPC & Hadoop®, I say big fast data
While the “big” part of the big data nomenclature is a characteristic that garnered so many headlines, organizations using big data systems like Hadoop, in meaningful ways are coming to discover that size alone isn’t enough. Though I’ll use the term Hadoop in this discussion, I’m really referring to anything in the MapReduce ecosystem, as well as a host of alternative “big data” systems. There are a few reasons why organizations need to think about being fast, before they get big.
Reason 1: Getting big is the easy part
Go big, or stay relational… or even flat, for that matter. Technologies like Hadoop aren’t interesting for their ease of use, as they’re arguably prohibitively complex at this point in time. For IT veterans who’ve spent years as gatekeepers, weighing the cost of adding additional data to exorbitantly expensive proprietary databases, there’s a freedom in being able to throw terabytes (even petabytes), into a system without having to migrate the database or spend months in negotiations with vendors to determine how growing that database or warehouse will cost, or whether the repository is even theoretically capable of handing the volume of data. So it’s naturally tempting to stand up a Hadoop cluster. And it’s not really rocket science to begin populating that cluster with volumes of data that might have made other repositories grind to a screeching halt. But getting big is the easy part.
Reason 2: When you build it, they will come…
…and they’re bringing loads of junk with them. Once word is out that you’ve stood up an almost infinitely scalable system (that is the Hadoop hype), your constituents are going to want to throw data into it that you’ve never dreamed of. They’ll want to bring operational data that wasn’t even important enough to store anywhere before. They’ll bring new and exciting data sources that they never expected to be able to analyze before, eager to chunk them into this new system capable of dealing with “variety.” Heck, they’ll even want to incorporate massive amounts of public and/or third-party data that they’ll expect to be able to add into the mix. They’ll expect all of that data to be retained until their great grandkids can use it. And one more thing — eventually, they’ll want it to be current (aka real time). Still, while adding all of that data introduces significant challenges (e.g., ETL, security, etc.), it can be done.
Reason 3: Eventually, they’ll want their own set of keys…
…along with the freedom to use them whenever they want. Knowledge is power, and today’s knowledge comes from data. Try as you might, you will not succeed in maintaining the role of gatekeeper to such power indefinitely. The best you can hope for is to stall other groups for as long as it will take them to cut an easy-to-use set of keys, but eventually everyone will want the ability to mine the system themselves. Remember the days when running a query against a data warehouse required developer chops? Now, a marketing dweeb (I am one) can bang against a warehouse from Excel. While MapReduce analytics is still the domain of the “data scientists,” a new wave of “for dummies” tools is at least promising to level the playing field. Look at how fast the old BI vendors are lining up to adapt their user-friendly analytics tools to the Hadoop market. In the long run, resistance is futile. Everyone will eventually demand direct access to your shiny new Hadoop system. And if you thought that managing the workloads of a few relatively well-behaved data scientists was tough, just wait to see the chaos that a marketing team or group of graduate students can throw at a system.
Reason 4: They’ll insist on making a few, and then many, “improvements”
Again, if you build it, they will come — and they’ll be bringing a lot of new and “exciting” use cases with them. So, prepare for the hit in complexity that some of the more advanced things in the MapReduce ecosystem promise such as: Advanced Hive™ and HBase™ capabilities, lots of third-party software integration, GraphLab stuffs, machine learning (MAHOUT), etc. All of this will add a great deal of complexity and demand performance above and beyond the simple reporting that your organization might have originally envisioned. Your Pig scripts will grow from Post-it note-sized snippets to being novellas faster than you can say “predictive analytics.”
Reason 5: It will have to perform…
…because now that you built it, they started using it. So, now you’ll be faced with: exploding data volumes, increased workloads, dramatically increased user access, insane levels of algorithmic complexity…. But wait, there’s more. Now that it’s become important to them, they’ll increase their service level expectations. When they run a job against it, a job that they’ll likely expect to be able to run 24/7/365, they’ll expect results fast. Then, once you’ve gotten that nailed, they’ll likely want to run twice the jobs with results in half the time. OK, while much of this isn’t necessarily Hadoop-specific, there are some elements about that specific software stack that will make this especially painful.
Reason 6: Being big is the hard part
Your big data solution, whatever that winds up being, will become difficult after it’s big – and this is as much a political issue as it is a technical one. Prior to the promise of big data, a department typically “owned” a repository. They’d budget for it, stand it up, and manage it appropriately. If another department wanted something from that repository, they’d take a number and get it when it was convenient. Yes, data warehouses bent that mold a bit, but there was still some order to the madness along with considerable perceived costs associated with storing and managing data in the warehouse.
Now, technologies like Hadoop promise near limitless data volumes (again, that’s the hype), along with the perception of being extremely cheap storage. I’ve heard analysts say things like “Hadoop will be as or more strategic than their ERP systems to the Fortune 100” and I believe them. In an age where data is power, it isn’t usability or performance that makes Hadoop interesting, but it’s the sheer size that makes it an irresistible and unstoppable force. So, once an instance is stood up and has been populated with massive amounts of raw data, everyone will want a piece of it. They will all see it as strategic and valuable. If you’re unlucky enough to be the primary “owner” of a big data system, their lofty expectations for its performance and stability will likely be your problem.
HPC Hadoop is about being fast and big
In light of this reasoning, which I wholeheartedly subscribe to, I’m excited about the prospects for HPC with big data. Clearly, technologies like Hadoop are designed to get big by throwing lots of mediocre hardware at it. But once that system becomes big, it will have to be fast. For any big data system, being fast will strain storage, compute, and I/O. And the complexity introduced by using some of the more exotic widgets in the Hadoop/MapReduce ecosystem will stress compute, storage, and I/O in some exciting and unpredictable ways. There are other reasons dealing with Hadoop, or any big data systems, as HPC solutions make sense. Management and administration, security, energy costs, maintenance, and more are all easier in a true HPC environment than with ad-hoc infrastructure cobbled together with duct tape and bailing wire. While these ad-hoc solutions can be easy to grow large, the headaches really start once these systems are big.
Chicken or the egg — the big data dilemma
I became a chicken owner after buying my kids a couple chicks one Easter afternoon. Soon, these grown-up birds were cutting through our fences and getting eaten by neighbors’ dogs. Eventually, I wound up mending fences and building a proper coop, but that was only after lots of headaches and a few losses that were painful to my kids. I learned a lot about planning for the inevitable — chickens roam and bird-dogs kill chickens — in the process. It strikes me that there are some similarities between my dabbling in poultry, and an organization’s Hadoop endeavors.
Performance WILL become an issue as the value of an operational big data solution increases. Constituents WILL want to incorporate more data into it, and they WILL eventually want to interact with it directly. They WILL insist that some of the more exotic Hadoop/MapReduce capabilities are rolled out. And they WILL expect a high level of performance. The level of their expectations will increase in a manner that is roughly proportional to the volume of data that lives in the system.
So, the only real dilemma is whether to haphazardly deal with your unruly chicken after it grows up, or to plan for its inevitable behavioral challenges before it hatches. Either way, getting big is the easy part.