As the Internet of Things (IoT) becomes a reality, the volume of data that will be generated by the multitude of connected devices, machines, and processes — in the consumer, business, and industrial worlds — is expected to be massive. In short, the more devices and machines that get connected, the more data that is going to be generated.
Achieving some kind of business value from this massive data reservoir will require the use of big data storage and analysis technologies that can scale to meet the constantly increasing demands placed on organizations. These include:
- NoSQL file systems
- NoSQL databases
- High-performance relational analytic and in-memory database appliances
- Hybrid relational databases with embedded MapReduce
- Streaming analytics systems
All of these technologies provide varying capabilities for managing and analyzing sensor and other data associated with IoT applications and services. That said, a key point to keep in mind is that none of them on its own currently offers an all-encompassing solution that can serve every need for IoT application requirements. Consequently, I recommend you consider these technologies as complementary.
NoSQL data stores like Hadoop and NoSQL databases (Cassandra, MongoDB, etc.) are frequently used to support the operational side of big data processing (e.g., capture, storing, initial processing, and sorting). NoSQL technologies are proving popular for managing and storing extremely large volumes of sensor and other machine-generated data streaming from connected devices and machines.
High-performance analytic databases (e.g., IBM Netezza, HP Vertica, SAP HANA, Microsoft SQL Server PDW, and Teradata) typically serve to handle in-depth analytic processing needs; in particular, applications that require combining machine data analysis with other forms of enterprise data (e.g., billing, customer profiles, mobile usage) analysis to support comprehensive analytic requirements.
Hybrid relational databases with embedded MapReduce functionality (Teradata, Greenplum, etc.), which are still a fairly new development for most organizations, are also attractive for comprehensive analytic applications that require combining machine data analysis with other forms of enterprise analysis. Their appeal is that they provide a unified platform that combines the familiarity of SQL with the power of the MapReduce programming model. This hybrid architecture also supports the processing of different types of data regardless of where the data resides.
Streaming analytics systems (e.g., IBM InfoSphere Streams, Amazon Kinesis, and Google Cloud Data Flow now under development) are suited for applications in which real-time analysis of high-volume data feeds is essential, such as in medical life-support monitoring, telecom, intelligence gathering, and transportation.
These use cases are not perfect; however, they do give some idea as to the types of applications for which the various technologies are being utilized. (Note, I examine these technologies in relation to their use for IoT applications in my report: “A Brave New Connected World: The Internet of Things and the Rise of Small Sensors and Big Data Analysis.”)
Finally, some organizations will choose to build their own systems for managing and analyzing the vast amounts of data generated by connected applications. Others will opt for solutions from commercial providers (e.g., Splunk, Amazon, Microsoft). Regardless of which options they use, organizations should seek to deploy applications in a cloud environment in order to take advantage of the flexibility, scalability, and performance offered by cloud-based architectures and services — including capabilities for publishing APIs and Web services to facilitate the exchange and integration of machine data with enterprise data and to add analytic capabilities to enterprise applications.