In the last lesson we saw what kind of performance we can get by storing our data in RAM using Vertipaq (xVelocity) instead of storing it on disk. The next question we have is “but how did we get all that data into RAM in the first place?”
Let’s first look at how information is normally stored in the database on disk using SQL Server. The way that works is to take rows of data one row at a time and tart filling pages of disk storage with those rows.
We fill the page with rows until the page is filled. When it’s full we create another page, until we have our entire table filled out into disk storage. When SQL Server needs that information back–for example it needs row three–it retrieves page #1 to load in row #3. This is done transparently to the user, but this is how data is stored.
The compression we can get really is limited to what can be compressed within the page. If there would be a benefit to compressing data across pages, there’s really not much that we can do about that because of the way data is stored row by row within pages
But if you think about it, we compress data by finding redundancies and factoring them out. If we have the same value within a column on many rows, we can compress that by not storing the value over and over. If we look at the factSales table that we were querying in the previous lesson, we can see that in all these columns we had a lot of redundancy. The StoreKey appears over and over again—the TimeKey as well. The ItemKey is less redundant, but we have many values in many columns that are stored over and over and over again.
We can’t really take advantage of the redundancy when storing rows on pages like this. We’re going to need a lot more disk storage, and because these pages take up so much space we can’t efficiently store so many of these pages in RAM.
Let’s contrast that with the way Vertipaq (xVelocity) is storing data. Instead of a row store strategy, Vertipaq (xVelocity) uses a column store strategy. What do we mean by that?
In the previous slide we saw that pages of information were stored with many rows of data. What Vertipaq (xVelocity) does is to create pages sort of like the ones we saw before, but its storing data within a single column on a page together. So if you think about the columns that had the redundant values on rows, all those rows are going to be together on the same set of pages. So we instantly we can get tremendous compression because now if the entire page, for example, was all NULL, we only need to store NULL once and then just indicate the value is NULL for all rows on the page (or something like that).
If the value is a string that’s very long, we can store that string once and then make a footnote that says “by the way, all the rows on this page are the same string as the first one”. Now we can get 10X more rows on the page. We can get much better compression characteristics, and as the columns are loaded into the column store, this is what it begins to look like.
Column #1 is on a couple pages in this diagram Column #2 is on a couple pages. We’re taking advantage of tremendous compression opportunities to store many, many rows on each columnar page. And since each of these pages has so many rows, when we store each page in RAM, we’re actually storing the value for many more rows in RAM at once compared to the row oriented storage strategy.
That’s fundamentally why Vertipaq (xVelocity) is able to store so much data in memory. This diagram summarizes the tables we were just querying and shows the benefit of using Vertipaq (xVelocity) for the column storage rather than the traditional row storage we use with SQL Server.
In our 250 million row database, the fact table is by far the most of data storage. This is typical for a data warehouse. In this column of the diagram we have the storage size in MB for the factSales table. This is an uncompressed SQL Storage of about 30GB. If we were to compress that with SQL page compression, perhaps we could get that down to 20GB or so.
But it would still be a substantial pile of data to store in RAM if that’s what we were trying to do— probably not that practical. It wouldn’t take many tables to saturate our server’s RAM if we started storing that much data in RAM and querying it from RAM.
With Vertipaq (xVelocity), because of the compression levels it gets–which exceed 10:1 in this case–the data size is only about 2.7GB. This server has 32GB of RAM, so a 2.7GB memory footprint is well worth the performance improvement we get.
So that’s the reason that Vertipaq (xVelocity) works the way it does. That’s the reason we can store so much in RAM. And because we can query from RAM we get blindingly fast query performance.
Now, certainly there are other optimizations that make Vertipaq (xVelocity) fast, but fundamentally this is the biggest portion of why we get the performance we do from Vertipaq (xVelocity).