It still amazes me how many enterprise data warehousing/business intelligence (DW/BI) projects struggle, often to the point of paralysis, with the “Inmon/Kimball” debate. This impasse revolves around whether a DW/BI program should insist upon routing all information through a complex, third normal form (3NF) data layer or take it straight to a user-intelligible star schema repository from where it can be reported more or less directly. It’s easy to fault the 3NF for more than doubling the complexity, expense, and data latency of a DW/BI project, but also for being of zero direct value to the project sponsors and their stakeholders. On the other hand, projects that deliver data immediately to star schemas can quickly become complex themselves as the scope of the warehouse grows. When the conformed stars scale out, they too end up necessitating enormous reengineering efforts whenever the underlying business requirements change.

Truth is, neither of these architectural paradigms can be agile because they both result in inflexible juggernauts that defy economical impact analysis when new features are needed. Both approaches scale to installations that are so inscrutable and fragile that it becomes more economical simply to solve new requirements with new applications rather than updating the existing assets. Both the Inmon and Kimball approaches leave companies struggling with “legacy” warehouses.

One of the most promising solutions to this juggernaut problem revolves around data architectures that go beyond our traditional 3NF. Warehousing teams have achieved more robust designs when they push their target schemas past fourth and fifth normal forms into a variety of “hypernormal forms” (HNF). These forms include “data vault” forms, associative strategies, and even CJ Date’s new sixth normal form, where all attributes of a source schema should be shredded into something akin to key-value pairs when stored in a target system. My recent year of fieldwork revealed that, indeed, hypernormalized approaches yield warehouse designs that are far more robust in the face of changing requirements.

There are several tool suites and practice communities that have demonstrated the commercial viability of HNF for large projects with tight budgets and even tighter deadlines. What I particularly appreciate is how these new solutions truly enable agile data warehousing by enabling and accelerating incremental warehouse delivery patterns.

In terms of engineering advantages, HNF yields target schemas that are what I call “three-way robust”:

  1. Teams can add small increments of functionality without undertaking massive reengineering.
  2. Modifications to existing database tables involve only local impacts.
  3. As the model grows, the cost of scaling the warehouse increases less than linearly.

My fieldwork’s most remarkable discovery, however, was how the epitome of hypernormalized tools truly enable the fourth tenet of the agile data warehousing manifesto, where we focus on “evolving data models over incrementing application code.” There are tools that can take business models and automatically generate 90% of the integration, presentation, and semantic layers for a team. By leveraging such technology, warehouse developers can focus on the remaining 10% where the business rules are complex and hand coding adds the greatest value. Such tools make the warehouse project and the entire organization agile because IT and business can now collaboratively model a small sliver of the warehouse, generate a user-tangible result, and evaluate it together. With the distraction and time lost to coding now reduced to a minimum, IT is able to rapidly address additional requirements step-by-step by evolving the information users can touch, keeping the business constantly involved in the partnership, and rapidly zeroing in on the exact operational insights the current business challenge demands.

So, the days of endless Inmon/Kimball debates are over. New tools and practices have moved the pivotal question far beyond “star schema versus third normal form.” DW/BI project kickoffs should now be immersing themselves in the question: “Which hypernormalized modeling technique and supporting toolset will work best for our organization?” It is the astounding improvements in programmer productivity and customer satisfaction this new generation of solutions allow that demands we make this change in mindset.

avatar

Ralph Hughes

Ralph Hughes is a Senior Consultant with Cutter Consortium's Data Insight & Social BI practice. He also serves as Chief Systems Architect for Ceregenics, a DW/BI firm. He is also a certified ScrumMaster, a PMI project management professional, and a faculty member at the Data Warehouse Institute.

Discussion

  5 Responses to “Getting Data Integration Out of the Mud with Hypernormalized Data Designs”

  1. I don’t have a lot of experience with Data Warehousing but I have enough in meta-modeling and knowledge representation to wonder if I’m being offered a free lunch here. :) I think perhaps the issue comes down to that of the value of layering levels of abstraction on top of each other. Yes, you’re building a better and better lever, and making it more and more general, but with each step you forget more and more about what you’re levering and have to take more and more complex steps to recover it. That’s what the description “all attributes of a source schema should be shredded into something akin to key-value pairs when stored in a target system” brings to mind for me.

    My recent year of fieldwork revealed that, indeed, hypernormalized approaches yield warehouse designs that are far more robust in the face of changing requirements.

    It would be surprising if that weren’t the case, right? But perhaps I’m thinking about this in the wrong way. Could one say that order-N logic encoded as RDF triples would map conceptually to an extreme example of hyper-normalization? In my experience, such triples allow the encoding of almost any information known, but in the real world are almost impossible to infer anything useful from.

    I would love to see a real world example of the benefits of this in practice, but such real-world examples usually require one to be involved in the real-world effort in order to make any sense of them. And perhaps that in the end is a good point about the deep and inherent limitations in any attempt to systematically generalize knowledge.

    That’s why I think your point about “evolving data models over incrementing application code” is so key. Really, the thing that should be driving the construction of these systems is what you want to find out, not what you have to put in. The cynic in me notices that that is not an argument that data warehouse vendors are likely to endorse. Because after all, doesn’t that call even further into question the value of the whole notion of data-warehousing per se?

  2. Ralph,

    good article, however you are still working on the SEDW architecture, which after the discussions on the death of EDW (Gartners BI Summit) may be about only apply in certain areas. Additional usually I have found that the biggest area of concern on DW is usually in the sourcing (specially on master data) and the data dictionary.

  3. avatar

    One advantage of normalization is that it can identify many problems with the structure (and sometimes the content) of data resources. A star schema can also find some problems in the sources, but only within a deliberately narrow scope of “the facts”.

    My relatively limited DW experience says that you are at the mercy of the integrity of the source data systems.

  4. [...] Hypernormal forms to replace Inmon/Kimball data warehouses?  The normalization nut side of me likes this, but I’ll believe it when I see it. [...]

  5. “Truth is, neither of these architectural paradigms can be agile because they both result in inflexible juggernauts that defy economical impact analysis when new features are needed.”

    Nah, not true. We perform Agile Analytics on very large data warehouses with star schemas.

    We have found that denormalized star schemas do not prevent you from delivering features iteratively in sprints.

    That being said, it is still possible to over-engineer a DW data model to the point to where you make it brittle — easy to break.

    Of course, you can avoid that by using a schemaless key/value store NoSQL database. But I suppose that’s another argument for another thread … Br, Mark

 Leave a Reply

(required)

(required)

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>