Vitko August 17, 2009 at 15:49

Create an OLAP cube. Part 1

Continuing the subject of Multidimensional cubes, OLAP and MDX and olap for a small company , traditionally, I propose to start with a simple "Hello World" cube that will analyze the processes and trends of voting on Habré.

So, let's try to create our first OLAP system.
But, before rubbing my hands to launch Business Intelligence Studio, I propose first to create a data warehouse of habr-voices, the so-called Data Warehouse .
What for? There are several reasons for this:

the very essence of the Data Warehouse is to store "cleared" data ready for analysis, so even its initial structure can be very different from the structure of our Habr-OLTP database
in HabraDW (as we call it) we will only provide the information that we need for analysis, nothing more
Data Warehouse does not have normalization requirements. On the contrary, by denormalizing some data, you can achieve a more understandable scheme for building a cube, as well as the speed of loading data into a cube

A bit of theory.

Essentially, a Data Warehouse can be:

purely virtual (for example, defined as a set of SELECTs or even calls to complex stored procedures that somehow determine the input data for the cube)
quite real, that is, to exist physically on some server (or servers)

In the latter case, you most likely want to implement ETL processes (using Integration Services or something else), but this is a reason for another, no less interesting article.

What should be Data Warehouse?

It's very simple - your Data Warehouse should have the form of the structure of stars (star model) or a snowflake (snowflake model) and consist of the facts (facts) and dimensions (dimensions).

Facts are actual records about some process that we want to analyze, for example, the process of voting on Habr, or the process of changing the price of goods on the exchange. Very often, facts contain some kind of numerical data, for example, the actual value of a vote or price.

Measurements- these are the defining attributes of facts, and usually answer all kinds of questions: when a fact occurred, over what or with what exactly, who was the object or subject, etc. Basically, measurements have a more descriptive (i.e. textual) character, for example, a username or a month’s name, since it will be much easier for an end user to perceive the results described by a text (for example, a month’s name) rather than numbers (month number in a year).

Having determined where we have the facts and where the measurements are, it is very simple to build a model of the star .

Star.

In the center we indicate our fact table, and with the rays we derive measurements.

And now a snowflake.

A snowflake is the same star , only measurements can depend on measurements of the next level, and those in turn can include more levels.

Each of these models has its own advantages and disadvantages, and the actual choice of model should be based on the requirements for cube design, data loading speed, disk space, etc.
Naturally, the final Data Warehouse is usually much more complex and consists of several stars or snowflakes that can share common dimensions.

HabraDW.

Let's move on to the actual development of our Data Warehouse.

Our goal is the analysis of voting trends on Habré, finding patterns and trends.
The main trends that we want to identify:

what time of year / month / week do they vote better / worse / more often
how to vote on Fridays and Mondays (for example)
how the presence of the words Microsoft or Karma in a post affects the voting result
average user activity, “peaks” of voting
etc.

For clarity, our first model will be absolutely simple - we will only include what relates to voting and exclude all unnecessary ones, including the time of user registration and the fact of who posted the article, as well as the time of voting (only date) and other attributes (all of these data can be included in future articles and try to analyze more complex things).

As a result, we have the following tables:

FactHabravote fact table - determines who, when, for what and how exactly voted. The value of Vote in our case will be + - 1, but the field type allows you to expand the delta with voices, for example, + - 10
DimTime time measurement - defines the attributes of time necessary for analysis (values and names)
DimUser user measurement - defines Habr users, so far only a nickname
Measurement of posts DimPost - defines posts, in our case it contains a title and boolean fields that determine whether the post contains the words Microsoft and Karma.

The final layout of our star will be like this.

And here is the original SQL script that creates and fills (so far only with random data) our repository.

Well, now everything is ready to load the data into the cube.
See you in the next article.

Tags: