In the last lesson we introduced the columnstore index, the technology it uses (xVelocity) and how we might use it.
So let’s look at a hands-on example and see how it works. in this database I have a fact table called factFlight. This table contains a record for every domestic flight in the US over the last 10 years or so.
If I look at the storage settings I can see I have about 11.2GB and around 140 million rows. This table happens to be a heap. It has no clustered index.
It has foreign keys and measures and is typical of what we’d see in a data warehouse.
One thing we might want to do with this type of date is summarize it. This query is going to summarize the flight distance mile for flights by year by carrier. We’re doing inner joins to bring in the carrier name and date dimension to make it look pretty.
The first thing we’re going to do is run this query. Essentially this will cause a brute force table scan of the table and give us output by year and carrier with the sum of distance miles. Let’s run this and wait for it to run. I’ll get back with you when it’s finished.
The query is complete. It ran in about one minute and seven seconds. In the output we have the year, carrier and miles. I didn’t order it, but I can see I have all the rows I’m looking for.
Just to keep the fight fair, we’ll run this a second time so that SQL Server can take advantage of anything that it has in RAM at this point. I’ll run it one more time and see how long it takes for the second run.
The second run is complete and it took 1 second less! So the RAM caching of data for this query didn’t do much good.
Now I’d like to compare how long would that same kind of query against the fact table take if I had a columnstore index. So let’s create one. Here’s the statement for creating that columnstore index. It’s just a create index statement with COLUMNSTORE added.
We’re going to include all the columns in the fact table. For the query I’m looking at I don’t really need all the columns, but I’m going to do this anyway so any query I run against this table will perform very well.
Our columnstore index is built. That took seven minutes and 53 seconds. Let’s look first at how big the index is. I have a query here which we can run to determine the size of our columnstore index. Essentially it’s going to query the system tables for the index type of “COLUMNSTORE” and add up how much space is used.
If I execute this query I can see that my columnstore index is about 3.5 GB. That’s on top of the roughly 11-12 GB that I already had in the table. So there’s a fair amount of disk used here. Let’s see whether that disk usage is worth it.
Let’s go back to our query against the flight table. First I’ll just run an execution plan to see if that index will be used in this query.
If I scroll down I can see there are a couple tables in this query plan. There will be a scan of the carrier table, and a scan of the date table. Those are fine. When we’re scanning the flight table we don’t see the table itself anywhere in the plan. We see a scan of the columnstore index. After that scan we see the hash match and so on to get the output.
We’re not going to hit the table; we’ll just hit the index. Let’s see how long that query now takes.
We’ve queried out our data, and that took three seconds. If I run that again…two seconds. So we’ve reduced our query time on this query from one minute and seven seconds to two seconds!
So that’s the columnstore index We’ve taken our 140 million row table, created a columnstore index, and that’s dropped this table scan query from a minute 7 seconds down to 2 seconds.