
Here’s the revision of Chapter Two for Mixed Model Arts, where I discuss various definitions of data modeling and bring it into the present day. We are no longer modeling just for humans, but modeling for humans AND machines.
Chapter Three (revised) will come out either tomorrow or Wednesday. From there, expect many drops, as I’m in the middle of editing and finalizing all of the chapters in the book. There are also a few chapters I have not released, which are later in the book, so expect to see those very soon as well.
Thanks,
Joe
Walk into any combat sports gym for the first time, and the coach won’t let you spar. You’ll learn the basics and fundamentals of the craft. My first time learning Brazilian Jiu-Jitsu involved absorbing its history and drilling the basics of ground fighting. If you’re at a boxing gym, you need to learn what striking is, some basic punches, and drill proper footwork and head movement. When learning any combat sports discipline, you need the mental model and vocabulary before you can learn effectively.
Data modeling is no different. When you skip learning the fundamentals and jump straight into building, you’ll get a beatdown. Here is what that looks like.
A Common Data Disaster
I’ve seen quite a few data disasters. Here’s one that’s embellished for anonymity and extra spice.
I’d just been hired as a consultant for a mid-sized e-commerce company that was hemorrhaging money. Their CEO pulled me aside within the first hour: “Joe, our inventory system says we have 50,000 units in stock. Our warehouse says 12,000. Our website shows customers that they can buy things that don’t exist. We had to refund $400K last month alone.”
I spent the next three days spelunking through their systems. What I found was a horror show. They had an ‘orders’ table with 500 columns. Customer data lived in six different databases, none of which agreed on what a “customer” was. The product catalog was a single enormous spreadsheet that someone manually uploaded to the database every Friday afternoon. Date fields were stored as strings. Some prices included tax, some didn’t, and nobody could tell you which was which.
Their data modeling problem was actually a lack of data modeling. Nobody had ever sat down and asked the most basic questions: What are we tracking? How do these things relate to each other? What does each piece of information actually mean?
That company spent eighteen months and a significant amount of money unwinding the mess. Most of the work wasn’t technical. It was the messy work of getting people to agree on definitions, drawing out the relationships between business concepts, and building a shared map of how the business actually worked. In other words, it was data modeling. This is a real-world version of the e-commerce platform we’ll use as our running example—and it shows exactly what happens when no one thinks across the layers we discussed in Chapter 1.
This chapter is about what data modeling is—what it means, what it includes, and what it decidedly is not. Once we get precise about the definition, we’ll be ready to confront the harder question: why does data modeling matter, and what happens when you skip it? That’s Chapter 3.
What Is Data Modeling?
If you ask a group of people this question, you’ll get as many answers as the number of people you ask. Let’s start by defining data and models, and then clarifying what data modeling is and is not.
It Starts With Data
“We do not, it seems, have a very clear and commonly agreed upon set of notions about data—either what they are, how they should be fed and cared for, or their relation to the design of programming languages and operating systems.” — George Mealy, Another Look at Data (1967)
Mealy wrote that in 1967. It’s still true. We’ve built a multi-trillion-dollar industry on top of something we still can’t define. I find it both amusing and a little depressing that six decades later, we still can’t agree on what “data” means. When I ask professionals with “data” in their title how they define it, they often become flustered—like asking a fish to describe water. We’re so close to data that we can barely see it. And that blindness percolates to the broader organization. I see “data-driven” companies struggle to “get value” from their data while ignoring the feeding and care of their data. Data sits around, misunderstood and underutilized, like a gym membership no one uses.
Since this book is about data modeling, we need a working definition. Rather than getting bogged down in academic jargon, let’s look at it simply: data is a collection of values that convey information—quantities, qualities, facts—represented in a structured way that can be communicated, interpreted, and processed. Is this definition perfect? Probably not, and Mealy’s frustration continues. No matter. It serves as a “good enough” working definition for this book.
Then There’s a Model
“We are not modeling reality, but the way information about reality is processed, by people.” — William Kent, Data and Reality (1978)
This is a crucial distinction. You’re not building a perfect replica of the world. You’re building a useful representation of how your business thinks about the world.
The map is not the territory—but a good map can save your life.
A model abstracts a slice of reality, giving it form and structure that can be processed and consumed. Since the real world is constantly evolving, models must be dynamic too. As reality changes, models evolve.
What a Data Model Is
In my years circling the globe and chatting with countless data professionals at conferences and private events, I have never heard a consistent definition of what exactly a data model is. The responses I receive vary wildly by location, industry, technologies used, and experience level. Before I introduce my own definition, let’s examine how some well-known experts in the field describe it.
“A data model is a wayfinding tool for both business and IT professionals, which uses a set of symbols and text to precisely explain a subset of real information to improve communication within the organization and thereby lead to a more flexible and stable application environment.” - Steve Hoberman, Data Modeling Made Simple.
“Data models are techniques for representing information, and are at the same time sufficiently structured and simplistic as to fit well into computer technology” - William Kent, Data and Reality.
“A data model tells a story, and the story is about how a group of people come together to use data to solve a business problem (or take advantage of a business opportunity). The data model becomes a record of the journey from the conception of the problem to its solution.” - Larry Burns, Data Model Storytelling.
As I said, there’s no consistency to how people define data modeling. The definitions that do exist tend to pull in different directions. Steve Hoberman calls it a wayfinding tool that improves communication. William Kent sees data models as “techniques for representing information…sufficiently structured and simplistic as to fit well into computer technology”—a pragmatic, technical view. Larry Burns sees something more human: “A data model tells a story…about how a group of people comes together to use data to solve a business problem.” One focuses on machines, the other on people. All three are right. All three are incomplete.
Here’s mine.
A data model organizes and standardizes data in a precise, structured representation, enabling and guiding human and machine behavior, informing decision-making, and facilitating actions.
Notice the emphasis on machines. This is a deliberate departure from other definitions. As we saw in Chapter 1, AI agents and LLMs are now first-class consumers of your data models, and their ability to write correct queries, avoid hallucinations, and reason about your business depends directly on how well the data is modeled. Most data has always been modeled for machines, not humans. Think about the computer systems you interact with daily—your computer, smartphone, and every app you use. Since the dawn of computing, humans have modeled data for machines to perform automated tasks. My definition simply makes this explicit.
Go back to that e-commerce company. If someone had properly modeled “order”—with real relationships to inventory, customers, and fulfillment—a warehouse manager could see what’s actually in stock without calling three people. An AI agent could catch the divergence between inventory counts and order history before customers start buying things that don’t exist. Same model, two consumers. That’s what the definition looks like in practice.
As machine learning and AI become increasingly ubiquitous, machines will become more autonomous. Data is growing to a scale where humans struggle to reason about it without machine assistance. The traditional human-centric approach to data modeling needs to evolve.
This definition also widens the range of data modeling options. Historically, data modeling is viewed through the lens of particular use cases—analytics, usually—as if that’s the only place where data is modeled. This ignores the data models that exist upstream and downstream. As we explored in Chapter 1 with Mixed Model Arts, data modeling isn’t a single approach—it’s a toolkit of techniques applied to the situation at hand. There is no one true approach. There are many approaches. Pick what works for your situation.
Data modeling isn’t static because the world isn’t static. Definitions of data modeling will likely change, including mine. Even people like Hoberman, whom I view as a pillar in data modeling, regularly review and modify their definition. As an industry, we must constantly challenge our assumptions, choose what works, set aside what doesn’t, and evolve. That’s how we grow. But it’s essential to put a stake in the ground, and for now, this is how I view data modeling.