agsandrew - Fotolia
When big data practices come to your organization, it's all about location, location, location.
I've heard recently from a bunch of big-data-related vendors that are all vying to gain from your sure-to-grow big data footprint. After all, big data isn't about minimizing your data set, but making the best use of as much data as you can possibly manage. That's not a bad definition of big data if you are still looking for one. With all this growing data, you will need a growing data center infrastructure to match.
This big data craze really got started with Apache Hadoop's Distributed File System (HDFS), which unlocked the vision of massive data analysis based on cost-effective scale-out clusters of commodity servers using relatively cheap local attached disks. Hadoop and its ecosystem of solutions let you keep and analyze all kinds of data in its natural raw low-level form (i.e., not fully database structured), no matter how much you pile up or how fast it grows.
The problem is that once you get beyond, err, data science projects, old familiar enterprise data management issues return to the forefront, including data security, protection, reliability, operational performance and creeping Opex costs.
While Hadoop and HDFS mature with each release, there are still a lot of gaps when it comes to meeting enterprise requirements. It turns out that those commodity scale-out clusters of direct-attached storage (DAS) might not actually offer the lowest total cost of ownership when big data lands in production operations.
It all boils down to where all that enterprise big data will live. We certainly don't want to make more copies of big data than we have to -- moving, copying, backing up, and replicating big data is, well, a big job. We do need to manage it as securely and carefully, or even more so, than smaller disparate databases that don't hold as much detailed information. If we base critical business workflows on new big data processes, we'll need it all operationally resilient and performant. And we certainly can't break the bank.
New housing options for big data
For efforts based on multiple petabytes, core Hadoop on physical DAS might still make the most sense, since the associated high level of corporate investment in expertise and operations is justifiable. But there can be big storage issues when native HDFS hits the data center. First, the default scheme is to replicate all that data in triplicate, after moving it from wherever it was generated. HDFS is generally optimized for big block I/O, which leaves out transactional or interactive solutions. Downstream use usually means copying data out again. And while there are native snapshots, they aren't fully consistent or point-in-time.
For these and other reasons, enterprise storage array vendors have created clever HDFS adaptations, which allow Hadoop computing to leverage external storage. This, of course, seems mad to some open source big data pundits. But to many enterprise IT folks, it provides a great compromise: no high-maintenance storage or new storage paradigm to accommodate, but with some cost.
Vendors such as EMC with Isilon, which provides a remote HDFS interface to the Hadoop cluster, are doing a brisk business. Protection, security and other concerns are handled as they would be for any other data in Isilon. Another benefit is that data in external storage can often be accessed with other protocols (like Network File System, NFS), supporting workflows and limiting data movement and total copies needed within an organization. NetApp is also on this train, with a big data reference architecture that marries a combination of its storage solutions directly into the Hadoop cluster.
Another option worth mentioning is that of virtualizing big data analytics. Both the analytical compute and storage nodes can be hosted virtually. VMware and Red Hat/OpenStack have Hadoop virtualization solutions. Still, virtually hosted HDFS data nodes don't solve enterprise storage concerns. An intriguing startup called BlueData has a new option here. It virtualizes the compute side of Hadoop and then enables an enterprise to hook up existing data sets from whatever they have -- storage-area network or network-attached storage -- accelerating and translating it to HDFS under the covers. In this way, big data analytics can drop into a data center without any data movement, new storage infrastructure, new data flows or data management changes at all.
While most Hadoop distributions hew close to the Apache open source HDFS (which I see as software-defined storage for big data), MapR took a different approach. It essentially recognized that Hadoop needed an enterprise-featured storage service, and so built its own HDFS-compatible storage layer under Hadoop. The MapR version is fully capable of transactional I/O with snapshots and replication support, and natively supports other protocols like NFS. It's also pretty darned performant, and helps MapR deliver enterprise operations intelligence applications, operational decision-support solutions that depend on big data history and real-time information together. Thinking along similar lines, IBM has baked its high-performance-computing GPFS storage APIs into the Hadoop distro as an alternative to HDFS.
A few other interesting solutions can help tame data challenges. One is Dataguise, a data security startup that can practically and efficiently secure Hadoop big data sets with some unique IP that automagically recognizes and globally masks or encrypts sensitive material across a big data cluster. Waterline Data Science is a new entrant that promises to automatically inventory whatever is in those petabytes of HDFS, especially interesting if you go down the path of landing all your data files into a Hadoop data lake or hub. Going in another direction is not necessarily big but the "broad data" product offered by Pneuron helps quickly build business applications that leverage data in place despite being scattered across many sources and locations.
Obviously, with fewer copies, your data is centralized and available to support multiple kinds of analysis. And the better protected and reliable the whole operation is, the more value you'll be able to recognize. And that means that the obvious vanilla HDFS on commodity disk may not be the best big data solution for the enterprise data center.
If you've been holding Hadoop out of the data center for management or enterprise storage reasons, it might be a good time to take a look at some of these new options. You don't want to be left out of the big data game.
About the author:
Mike Matchett is a senior analyst and consultant at Taneja Group.
- An Intro to Machine Learning for Big Data –Tamr
- Advanced analytics meld with machine learning to press more value from big data –ComputerWeekly.com
- How Acronis Leverages Machine Learning to Secure Data –Acronis
- Machine Learning in the Oracle Autonomous Data Warehouse –Oracle Corporation