One of the fundamentals of data warehouse design when using the Kimball method is that we’re going to organize our relational data warehouse into a Star Schema.
Why do we do that? Why don’t we just design a data warehouse the same way we design a transactional database? We’re going to talk about that for just a couple minutes. We’re going to come back to this design, which is a collection of tables which makes a very simple data warehouse. You might even call this a data mart.
Before we look at that, let’s take a look at our more typical 3rd normal form transactional database. When you look at one of these you see the first thing that hits you is that there are a lot of tables and a lot of complexity in the way they’re joined. So we see joins that go through tables and keep going and continue on to other lookup tables that continue to still other lookup tables.
This all is excellent modeling for a transactional system. It keeps the amount of redundancy in tables to a minimum. It keeps inserts very efficient. It makes updates very efficient too. For example if the category description for a product or the model name for a product changes, I don’t need to change the product table because there’s a model table that has that information.
This is fantastic stuff for transactional systems, but to query this kind of a database is just daunting for most people. I you understand this design completely it’s great! If you don’t, you’ll spend the first couple weeks working with this database just trying to figure out what’s in it. It’s not intuitive for users that aren’t expert in its design.
It also isn’t efficient from a query perspective in terms of joins. If you want to extract a lot of information and slice and dice from this kind of database, you’ll be joining perhaps a dozen of these tables on each of our queries. That can be a performance problem. The engine is going to work really hard and the query is going to be very difficult.
To solve some of those problems and challenges, we flatten things out in the star schema design. WE try to minimize the number of jumps we have between the metric data and the dimensions we’re going to slice that data by — such as time and so on.
So you will notice in the item table that we don’t have a separate category table. We have the category description with the item. If we looked at that table, the category description would be repeated over and over. In a transactional database design this would be considered a terrible design! In OLTP it would be considered inefficient and the wrong way to design a database.
This is true if your goal is to design a transactional database optimized for inserts. But when querying data this design makes it super easy for users and honestly in most data warehouses 98% of your storage is in the fact tables. These other tables might be 2%, so the “wasted space” really doesn’t matter.
Why do we call this a star? Quite simply we call it a star because that’s what it looks like. If we look at this schema when were done with it we actually have something that looks like a star. So we give it the name “Star Schema”. It may be a silly reason to call it a star schema, but that’s the reason.
What you’re looking for is that you don’t have a lot of joins between the dimension tables. For any query we’ll have one join to get any piece of information joined to the fact table. We’ll look at a query example in a moment.
I will mention in passing that you might hear the term “Snowflake Schema”, and a snowflake is a star that develops more complexity. Let’s say I had a model table that was connected to my items table. I then connect the Model table to my Items table. Now you can see that my design doesn’t look as much like a star, but starts to look like something else. If you imagine a snowflake, it has this type o radial spiral pattern.
Some people think Snowflakes are bad, others think they’re fine. I’m a little agnostic about this. Sometimes it’s necessary, but one thing to keep in mind is that in this design the Model key needs to have a key in the Item table. We could just as easily put these attributes into the Item table directly. That design would work fine and would maintain the Star Schema shape.
There will always be a disagreement whether Snowflake Schema is good or bad. If you get carried away with snowflaking your design, you’ll end up right back at a 3rd Normal form (so be careful!).
So that’s what the Star Schema looks like. Let’s just look at why this makes queries so much easier. Here’s a typical query that we might use on a star schema, where we want sales with inner joins. Each of these joins is to a single table. We don’t need to join Item to some other table and the other table to yet another table as we would din a 3rd normal form database.
So to join any piece of information to the fact table we have only one join to a dimension table. We have a where clause here, and if we want to drill down we just keep adding to the WHERE clause.
So that’s my sales for the week, and if I want to drill down to the west I can do that. If I want category join that dimension. It’s very quick to build.
This query is very simple. Its’ easy to follow. It’s pretty easy for end-user tools to do this kind of thing where you keep adding to the where clause to drill down.
That’s how a star scheme looks, why we build it and how you use it.