Jun 302015

The Data Lake as an Exploration Platform
The data lake is an attractive use case for enterprises seeking to capitalize on Hadoop’s big data processing capabilities. This is because it offers a platform for solving a major problem affecting most organizations: how to collect, store, and assimilate a range of data that exists in multiple, varying, and often incompatible formats strung out across the organization in different sources and file systems.

In the data lake scenario, Hadoop serves as a repository for managing multiple kinds of data: structured, unstructured, and semistructured. But what do you do with all this data once you get it into Hadoop? After all, unless it is used to gain some sort of business value, the data lake will end up becoming just another “data swamp” (sorry, couldn’t resist the metaphor). For this reason, some organizations are using the data lake as the foundation for their enterprise data exploration platform.

Think of the data lake as an enterprise-wide repository where all types of data can be arbitrarily stored in Hadoop prior to any formal definition of requirements or schema for the purposes of operational and exploratory analytics. In contrast with today’s relational-based data warehousing and analytics infrastructures, this is typically not the case due to constraints involving traditional (relational) databases, which require the predefinition of schema, and because of difficulties involved in integrating unstructured data and the high costs associated with storing very large data sets in such environments.

With the data lake, unstructured and structured data is loaded into Hadoop in its raw native format. In contrast to your typical enterprise (SQL-based) data warehouse, the Hadoop-based data lake is for the storage and analysis of huge amounts of “new” big data types that do not typically fit well in the relational data warehouse with more traditional enterprise data sources. In short, the data lake is designed to store very large files while providing very low latency read/write access and high throughput for big data applications, such as those involving high-resolution video; scientific analyses; medical imaging; large backup data; social media sentiment analysis; event streams; Web logs; and mobile/location, RFID scanner, and sensor data.

All this data offers insights into user behavior, purchasing patterns, machine interactions, process proficiencies, consumer preferences, market trends, and more. The purpose of the data lake exploration platform is basically to allow analysts to use Hadoop like a giant “big data analytics sandbox,” where they can conduct all sorts of iterative, investigative analyses to brainstorm new ideas and devise possible new analytic applications. Depending on the company and the business or industry, such applications can range from dynamic pricing, e-commerce personalization, and automated network security systems to real-time facial analytics meant to identify suspects in crowds.

Ed. note: Don’t miss Curt Hall’s upcoming webinar on the results of his newest Internet of Things survey!

Photo by David Shankbone. Licensed under CC BY-SA 3.0 via Wikimedia Commons.


Brian Dooley

Brian J. Dooley is a Senior Consultant with Cutter Consortium's Data Analytics & Digital Technologies practice. He is an author, analyst, and journalist with more than 30 years' experience in analyzing and writing about IT trends.


 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>