Today’s discussions of Big Data analytics almost invariably center on Hadoop, which includes a set of complementary solutions that aid in the development, management, and deployment of very large data sets. The projects include Pig, Hive, Cassandra, HBase, Avro, Chukwa, Mahout, and Zookeeper. Hadoop projects frequently use Hive and Pig, while the NoSQL databases HBase and Cassandra provide database platforms for many Hadoop projects.
While Hadoop has gained recent attention, it’s neither the only solution nor the first. Problems involving Big Data have been around for a long time, particularly in scientific computing, and many solutions have been found for specific problem types within areas such as high-performance computing (HPC) and grid computing.
As the Hadoop/MapReduce ecosystem evolves, the RDBMS-based data warehousing environment has also been developing to meet the challenges of Big Data analytics. This has included various methods of incorporating Hadoop and MapReduce as well as the recent development of special analytic RDBMSs designed to handle Big Data problems. From the HPC space, for example, Message Passing Interface (MPI) and Bulk Synchronous Parallel (BSP) provide parallel programming capabilities for complex algorithms on large data sets and have been in use for many years. New capabilities, including Google’s Dremel, Pregel, and Percolator, are also being deployed and made available as open source for companies with an urgent requirement for Big Data analysis.
Big Data analytics provides new capabilities but is only of value if it can be combined with other analytic and BI solutions. Since previous BI and analytics solutions have relied upon SQL and RDBMS, integration with SQL is important. This integration has ranged from the creation of SQL-ALike or SQL-Extended query languages to the use of Hadoop to extract data for insertion into data warehouses as a “super ETL” utility. Recent M&As have highlighted this strategy, most notably with Teradata’s acquisition of Aster Data and rival Greenplum being acquired by EMC.
In addition, vendors have adopted numerous strategies for accommodating SQL and MapReduce at the same time, including embedding SQL in MapReduce applications (Greenplum), adding traditional capabilities on top of Hadoop (IBM, Pentaho, Jaspersoft), providing an Hadoop connector for RDBMS (Aster Data), and layering SQL on top of Hadoop (Hadoop Hive). Including MapReduce in analytic RDBMS platforms potentially offers some of the best of both worlds.
Another important approach to Big Data is the creation of purpose-built processing appliances. We see IBM taking this path through its acquisition of appliance vendor Netezza and its development of Watson, which recently appeared as a “contestant” on the TV game show Jeopardy. These Big Data analytics appliances are basically “HPC in a cloud” devices that specialize in performing a range of analytic tasks, with processing efficiency built in at the hardware level. EMC has been active in this area through its acquisition of Greenplum, and HP has also been operating in this capacity, along with a number of smaller specialty firms.
Data volumes have soared astronomically in recent years, growing to include vast amounts of unstructured data that is capable of providing useful insights into the difficult problems that confront businesses in the 21st century. As data volumes have grown, however, conventional methods of storage and analysis are increasingly being challenged. The limits of conventional data warehousing methods based on RDBMSs have been exceeded in some cases, and new methods need to be found to extract meaningful information from enormous and often unstructured data stores.
Big Data analytics confronts issues that affect companies of all sizes, not just enterprises with large data processing departments. Costs have led to open source and cloud computing solutions, often in combination. Hadoop has emerged as the most important processing solution, but many others are available depending on the concerning problem. For data storage, retrieval, and accommodation of analytics for massive-scale data sets, NoSQL database systems have risen in importance. These systems, too, have been available for some time, but they have become particularly significant in the emerging Big Data analytics field, particularly as cloud-based processing brings these solutions down to a level where they can be used by smaller business, by departments, and for occasional queries.
While new solutions have emerged, it’s important to note that the traditional RDBMS and BI solutions vendors have acted to integrate Big Data offerings with their own solutions. These integrated offerings make it possible to find solutions to many of these problems without the difficulties of integration and customization.
Cloud computing and SaaS are also important elements of the growing movement toward Big Data analytics. There are now several vendors operating specifically in this area, providing solutions to special problems and offering lower-cost alternatives to inhouse processing.