shyshka - Fotolia
Small World Big Data
Published: 14 Apr 2015
One of the biggest themes in big data these days is data lakes.
Available data grows by the minute, and useful data comes in many different shapes and levels of structure. Big data (i.e., Hadoop) environments have proven good at batch processing of unstructured data at scale, and useful as an initial landing place to host all kinds of data in low-level or raw form in front of downstream data warehouse and business intelligence (BI) tools. On top of that, Hadoop environments are beginning to develop capabilities for analyzing structured data and for near real-time processing of streaming data.
The data lake concept captures all analytically useful data onto one single infrastructure. From there, we can apply a kind of "schema-on-read" approach using dynamic analytical applications, rather than pre-build static extract, transform and load (ETL) processes that feed only highly structured data warehouse views. With clever data lake strategies, we can combine SQL and NoSQL database approaches, and even meld online analytics processing (OLAP) and online transaction processing (OLTP) capabilities. Keeping data in a single, shared location means administrators can better provide and widely share not only the data, but an optimized infrastructure with (at least theoretically) simpler management overhead.
The smartest of new big data applications might combine different kinds of analysis over different kinds of data to produce new decision-making information based on operational intelligence. The Hadoop ecosystem isn't content with just offering super-sized stores of unstructured data, but has evolved quickly to become an all-purpose data platform in the data center.
Filling data lakes
The data lake concept centers on landing all analyzable data sets of any kind in raw or only lightly processed form onto the easily expandable scale-out Hadoop infrastructure. Instead of forcing data into a static schema and loading an ETL'd set into a structured database, essentially filtering, aggregating, and in general losing detail, a Hadoop-first approach enhances analysis agility by enabling the analyst to create new views of the data, or new data "schemas," on demand.
The data going into a lake might today consist of machine-generated logs and sensor data (e.g., Internet of Things), low-level customer behavior (e.g., Website clickstreams), social media, collections of documents (e.g., e-mail and customer files), and geo-location trails. There could also be images, video and audio, and even the more structured enterprise or customer resource management and other OLTP data useful for integrated analysis.
For operational intelligence, we must bring to bear more traditional OLAP/BI/reporting, real-time fluid perspectives, large statistical analysis and many kinds of machine learning. For maximum value, we need to feed the derived intelligence to the people making decisions, wherever they are, interactively.
When actively used, the data lake becomes much more than a large collection of data. It can become the master golden record repository that all downstream applications and analysis refer back to and from which insights are derived.
To data lakes and beyond
Many big data vendors have eagerly proposed ideas similar to the data lake concept under different names (i.e., data pond, data ocean, data refinery). Hortonworks, for example, in a blog post encourages IT to "collect everything" so users can dive in anywhere with flexible access.
The more data you decide to keep and process, the more IT resources you'll need to build around that lake, whether it's more infrastructure, big data savvy staff, licenses, or deep services and support. It's up to you to decide if one or more data lakes really makes sense for your data center. A core shift is coming to enterprise data processing, and the Hadoop ecosystem (looking less like the original narrow MapReduce-based Hadoop every time we turn around) presents a compelling and broad response.
For example, Project Myriad, sponsored by MapR, is helping combine cluster management and scheduling for both those big data distributed Hadoop/YARN style workloads and the longer running container-type (e.g., Docker) data center workloads such as Web, app, virtualization and database servers into one infrastructure. Looking down the road, not only will data lakes aggregate much enterprise data into one place, but many enterprise workloads will also coalesce back onto the same scale-out infrastructure.
Making data lakes work requires more than just redirecting data flow to the Hadoop cluster. Experienced IT folks might object to creating a single point of failure for an entire organization. Certainly along with the data lake comes a host of concerns with putting all that data in one place and then relying on it as a golden master. Availability, data protection, and ensuring BC/DR are only a few issues.
Perhaps the first problem with gathering so much data in one place is figuring out how to avoid getting stuck in the data swamp. You'll need to keep track of what information you have, where it came from, what versions might exist, how accurate it has proved to be, and how long it will be useful or relevant. It's helpful to know who or what other application might have used or found each data set useful, and to what purpose.
A second problem with big data aggregation is security. The original Hadoop had bare-bones security. Any data scientist with access to the cluster could access all its petabytes of data at will. A corporate data lake needs access, audit and authorization processes.
There is promise here with working solutions available today like that from Dataguise that can automatically mask personally identifiable information (PII) in both unstructured Hadoop and NoSQL databases. Still, with a big data lake, a devious analyst might be able to recreate sensitive data by marrying information from formerly widely disparate sources. There is still a need for thorough IT data governance and oversight, with training on the organization's approved uses of the assembled data.
Some technologies make these lakes data-aware: tagging and metadata reminiscent of object stores, native automatic search, and even "data aware" storage. One approach from BlueData leverages an organization's existing enterprise storage, making it appear "virtually" as native Hadoop Distributed File System. IT can maintain existing data protection and management best practices while supporting big data analytics.
Another approach from IBM touts its data refinery variation on the data lake as a way to bring big data out to the enterprise, baking in some automated transformations. They claim this approach directly provides the typical business analyst with a governed, on-demand big data set.
As a good example of broad integration, HP Haven can first conduct key transformations leveraging IDOL modules (formerly Autonomy).
A data lake should provide a number of fundamental capabilities:
1. Host a centralized index of the inventory of data (and metadata) that is available, including sources, versioning, veracity and accuracy.
2. Securely authorize, audit and grant access to subsets of data.
3. Enable IT governance of what is in the data lake and assist enforcing policies for retention and disposition (and importantly tracking PII and PII pre-cursors).
4. Ensure data protection at scale, for operational availability and BC/DR requirements.
5. Provide agile analytics into and from the data lake using multiple analytical approaches (i.e., not just Hadoop) and data workflows.
Big data bang for the buck
The main problem with big data projects is too much data and not quite enough information. Big data analytics should help inform critical decision-making. Building a data lake will certainly increase the amount of data available, and enable a new kind of analytical agility over the old traditional data warehouse. But if approached correctly it should also enable multiple new kinds of integrated analysis in a shared data-mining environment. Knowledge of the data itself should grow as the lake is fished, increasing its value to the organization.
Businesses might get to answer new questions that previously weren't even thought of because of the complexity or because the capability just wasn't there before to start.
MIKE MATCHETT is a senior analyst and consultant at Taneja Group.
Three steps to successful data lake implementation