

The Count from Sesame Street
Working on another building blocks chapter, this time on counting and aggregation. I’ve not seen this topic get much treatment from first principles. Sadly, many practitioners get exposed to via learning GROUP BY in SQL, and the results are often…let’s say mixed. This chapter will cover the basics of counting and aggregations (shown here), then move into the Structural Principles and Workflow for aggregating data. The goal is to equip the reader to count and aggregate data, not just in the traditional tabular sense, but also with multiple forms of data (images, text, etc).
The full chapter will be available something next week to paid subscribers.
Please leave a comment if you feel I’ve missed something important. And if there are grammatical or similar errors, please leave a note here.
Thanks,
Joe
How many times we’ve inherited poorly documented source systems (and even DWH, data vaults) and struggled to make sense of them. Well, look no further. Ellie.ai connects to 160+ systems and creates synthetic metadata with AI and our copilot with source navigator helps you answer questions like “what do I need to calculate profitability?”. You can even create effective models chatting with AI!
This article is brought to you by Ellie.ai. They’ve been an awesome partner of Practical Data Modeling, so check them out!
Modeling data is more than just collecting random data and blankly staring at it. Often, we want to count and aggregate our data to understand it better. You might be thinking, “Counting and aggregation? Are we doing analysis or data science?” Maybe. But other times, we might need to calculate the total for a customer’s order, or perform an aggregation for events in a stream processing system. Counting and aggregation are fundamental data modeling building blocks for these operations.
In earlier chapters, we talked about identifying things, relating them, and establishing their grain. With those pieces in place, we can finally talk about how to count things and how those counts become the basis for aggregation across any form of data: tabular, text, graph, event, image, etc. Let’s begin by (re)learning to count.
What Does it Mean to Count Something?
Counting sounds obvious, and on the surface, it’s literally what it sounds like - we’re counting things. But there’s some nuance to counting that requires answering some fundamental considerations about instances and identity, discrete or continuous, boundaries, context, and existence versus quantity. Let’s count the ways (pun intended).
- We need to know precisely what the “thing” we’re counting is. Refer back to our discussion on instances and identity. Are we counting a row, words, or characters within a text document, an event, or pixels in an image? Two items can only be counted separately if we can tell them apart.
- Does the thing exist at all (Occurrence), and if so, how many unique things exist (Cardinality)? For instance, “Does this customer exist in our system? And if so, how many orders do they have?” In data modeling, your basis for counting is strictly defined by the boundaries of the dataset you have chosen to frame.
- Is what we’re counting discrete or continuous? Some things can be counted as whole units (orders, users, word count). Others can be measured continuously (phenomenon such as temperature, pixel intensity, precipitation). A data model must specify which is which.
- Counts depend on the context, scope, and conditions of the question being asked. Back to our question, “Does this customer exist in our system? And if so, how many orders do they have?” This question only makes sense within the context of “within what time range was an order placed, and under what definition of customer?”
All of this ties back to defining what “one” of something means, how many such “ones” can exist, and under what conditions. Before we can aggregate something, we need to know what “one” means. Once we know what “one” is, we can talk about many. Let’s next look at how aggregation turns many small observations of “one” into a clearer view of the whole.
Aggregations: From Many to One
We now move to aggregation, the process of combining individual “ones” into larger, meaningful wholes. Aggregation allows us to perform mathematical operations, such as counting, summing, averaging (mean, mode, median), or finding the minimum or maximum, on a set of values to derive a new, summary value.
Aggregates always lose information, and every aggregation is a trade-off in the level of detail we want to capture. Whenever we aggregate, we are making two tightly coupled data modeling decisions. First, we decide which data we’re grouping or organizing. Along with this, we’re also inherently deciding what we keep and what we discard. So, aggregation is both a grouping and a filtering exercise that moves from a lower to a higher level of abstraction. For example, a handful of user events in a mobile app constitutes a session, or a collection of words constitutes a document. At a low level, either of these examples are collections of things - user events or words.
While aggregations are commonly associated with tabular data, we can aggregate data across all forms of data. With tabular data, we group rows by a key or another attribute. Event streams fold event sequences into sessions or windows. With text data, you might compute term frequencies or aggregate word vectors into sentence embeddings. With images and video, you can pool pixels or frames into higher-level features. Although we’re working with different forms of data, the same principle applies: identify what belongs in the data, define a grain, group the data, and aggregate it.
Aggregation only works when the grain of the underlying data is clearly defined. Changing the grain changes the meaning of both the grouping and the result. For example, aggregating at the event level yields different insights than aggregating at the session, document, or product level.
Grouping changes the grain, and is the mechanism by which we transform the data from one grain (the input) to a coarser grain (the aggregate). Grouping, in turn, defines how instances roll up into larger entities. This could be a category, a time bucket, a cluster, a user, a region, or any other dimension we introduce into the model. Different grouping conditions produce different summaries and different interpretations of the data.
So, aggregation reduces data into a smaller, summarized form. And this new aggregated data isn’t just numbers, but also a new entity in your data model. This entity has its own attributes, identifiers, grain, and relationships. This is where data modeling becomes recursive. Data aggregates roll up into summaries, summaries become entities, and these entities might themselves be aggregated further. Many become one, and these new “ones” are now available to become part of an even larger whole.
