Data Modeling Building Blocks
What Every Power BI User Needs to Know
If you’re new to Power BI, you’d be forgiven for seeing it as just a data visualization tool. That initial “click, drag, and drop” introduction to Power BI can be thrilling, but I worry it often gives short shrift to Power BI’s most exciting talents; its ability to pull data from many sources, transform them to meet your needs, and, for a grand finale, efficiently structure and connect data to keep your reporting fast and reliable.
That last part is called tabular data modeling. It’s based on core data warehousing principles introduced by Ralph Kimball (the granddaddy of Data Warehousing), and it’s awesome.
So how do you unleash the magical power of tabular data modeling to support your reports? I’m so glad you asked!
Every budding data modeler should start by learning how to use the four basic building blocks of a Power BI data model, which are:
While you can go deep into data modeling, these four components form the basis for everything else. They’ll give you a good “toe hold” when making data models or improving the ones you already have.
Let’s start with Fact Tables. In a data model, Fact tables are where you want the transactions and any data you want to sum or calculate to live. If you come from a SQL database background, you may be familiar with “Transaction tables” in relational databases, which are governed by much the same logic as fact tables.
Because fact tables often have millions of rows of data, you want them to have as few columns as possible to minimize the “weight” of the overall dataset. Since every additional column has a cost (slower processing time), you only want to include them if there’s no other way to do so. Folks often have an easier time remembering that fact tables are “long and thin”; many rows, minimal columns.
Let’s imagine you’re in a grocery store checkout line where you hear each item “beep” during checkout. Every “beep” is another item getting scanned and recorded as a new transaction row in the grocery store’s fact table.
The rows generated from buying a few items in the express lane might look something like this:
Note that you can’t just read over the above table and understand what happened with each transaction. Instead, we just have a unique ID in each of the key columns to represent each product, customer, and store, plus the date/time and the product price. This is an intentional way to minimize the amount of data required to write each transaction into the grocery store’s database.
Let’s zoom out from just the express lane so we can see what’s happening across the whole grocery store checkout area. Over the course of a busy hour, customers purchase tens of thousands of items. Every item that’s rung up in every checkout lane becomes a new row in the grocery store’s database.
If all transactions had to be written down by hand rather than by using a computer system, the cashiers would probably use shorthand and only keep track of the minimum possible information. Databases employ a similar approach by using keys as a kind of shorthand, which helps them keep up with the volume and speed of every transaction.
In the same way that an old-school cashier would know exactly what the shorthand code is for each product and may recall a great deal of information about every item, a database will be able to use each key value to recall additional details relevant to that transaction using a dimension table.
Dimension tables define what each individual key stands for, including any pertinent details that may be attached to that key. You can also think of a dimension table as providing the “who, what, when, & where” that goes with the “what happened” that’s recorded on the fact table.
Because dimension tables generally contain dramatically fewer rows than fact tables, they can have many columns holding as much descriptive data as is required. In contrast to fact tables “long and thin” description, dimension tables can be described as being “short and fat”.
Our grocery store’s Product dimension table will list each product key once and define that product key using the product name, whether the product is organic, where it’s stocked in the store, and any other information that would be valuable for analysis and reporting.
It might look something like this:
By holding all this descriptive data in a dimension table, each row of product information can be efficiently updated or augmented in just one row, without requiring any changes to what could be millions (or billions) of rows on the fact table itself. However, for a dimension table to fulfill its intended purpose, each key must be unique-it must be listed only once.
If this unique value requirement seems a little confusing, try putting yourself into the data model’s shoes. if I asked you to reference the table below and tell me what the Supplier is for Product Key 4011…
…you wouldn’t know whether to respond with “Banana Llama” or “Ewe Love Potassium” because they’re both listed for the same product key. Similarly, if you wanted to make a report comparing organic and conventional banana sales, you wouldn’t be able to do so because both are lumped together under the same code. Your data model wouldn’t know how to provide you with the right answer either-how could it?
We avoid this kind of ambiguity in a data model to ensure that you can obtain precise, reliable answers in a report. This is also why your Power BI data model will generate a warning (like the one in the image below) if you try to build a relationship between your fact and dimension tables without having unique key column values in either list.
Relationships are the connective tissue that make it possible for dimension tables to specialize in housing descriptive data so the fact tables can focus on maintaining all those millions of rows of transactions. To establish this relationship, you use the “key” for each table, which tells the data model exactly which row of information on a dimension table should be associated with a given transaction. You’ll often hear relationships described base on their direction and by whether they have 1 or many values on each side of the relationship.
In most instances, you’re looking for a 1-to-many relationship going in one direction from the dimension table into the fact table. While that wasn’t initially possible when we had two rows of data for Product Key 4011, I can meet the requirements of a 1-to-many relationship by giving organic bananas the unique key value they always deserved on the dimension table.
After I made this update, Power BI automatically recognized the relationship and built the connection between Fact_Sales and Dim_Products_Corrected for me. As you can see on the screenshot above, the relationship was established by using the Product Key columns on each with the 1 on the dimension table side and the “*” symbol, which means “many,” on the fact table side.
Let’s make this relationship a little more concrete so we can better understand what’s happening within the data model. Now that Product ID 4011 is only assigned to a single row on the dimension table, Power BI can easily understand it belongs on the “one” side of the “many to one” relationship. Many people will buy the same kind of bananas every day, resulting in this same 4011 product key repeating once for each of the many rows of data representing each banana sale on the fact table.
If we needed to generate a report on organic versus conventional banana sales using our updated fact and dimension tables, we would now be able to easily do so. We can also use the other fields on the dimension table to filter fact table information. For example, we could compare sales of all organic products against all non-organic products using the “IsOrganic” field or we could add a filter to a report so that users could view sales based on the Aisle number associated with that product.
Carrying filters from the dimension table to the fact table is the other important responsibility that relationships perform. Think of filters as ways you can slice and dice your data based on what information is available on a dimension table (e.g., “show me sales for only organic data” or “show me the count of all purchases made at our Front Street store location”).
When you create or edit a relationship in Power BI, you have the option to select “one” or “both” as your cross-filter direction. Most of the time, you want your filters to flow in one direction. Setting the filter to “both” enables you to treat both tables like a single big table, filtering from the fact table into the dimension table as well as the reverse, which is rarely necessary.
There you have it! In summary:
Fact tables should be long and thin. Their job is to hold many rows of transactional data, which is why every column should earn its keep by providing transaction-specific data or functioning as a key column that will connect to a dimension table.
Dimension tables should be short and fat. Their job is to hold as much descriptive information as necessary to support analysis and reporting related to the transactions held in the fact table.
Keys are a kind of shorthand that make it possible for your data model to quickly identify which row of highly descriptive data on the dimension table should be associated with transaction data on the fact table.
And, finally, relationships use those keys to explicitly define the connections between fact and dimension tables. While there are a few “flavors” of relationships, most of the time, you’ll want a “one to many” relationship between your dimension table and your fact table, and you’ll want your filters to flow in just one direction.
I hope that helps to demystify data modeling basics a bit as you forge new Power BI adventures! Below are a few of my favorite resources where you can learn more about this topic if you’d like to delve deeper.
LLenore Flower is the Senior Business Intelligence Analyst for Caf2Code.com, your go-to Microsoft Gold Certified Dynamics 365 development consultant agency. Read more →
Originally published at https://caf2code.com on February 22, 2022.