Multidimensional Cubes, OLAP, and MDX

    OLAPI’ve been an inhabitant of Habr for quite some time, but I have never been able to read articles on the topic of multidimensional cubes, OLAP and MDX, although the topic is very interesting and is becoming more and more relevant every day.
    It is no secret that during that short period of development of databases, electronic accounting and online systems, a great deal of data has accumulated. Now, of interest is also a full-fledged analysis of archives, and perhaps an attempt to predict situations for similar models in the future.
    On the other hand, large companies, even for several years, months or even weeks, can accumulate such large amounts of data that even their elementary analysis requires extraordinary approaches and strict hardware requirements. These may include banking transaction processing systems, exchange agents, telephone operators, etc.
    I think everyone knows 2 different approaches to building a database design: OLTP and OLAP. The first approach (Online Transaction Processing - real-time transaction processing) is designed for efficient data collection in real time, while the second (Online Analytical Processing - real-time analytical processing) is aimed specifically at the selection and processing of data in the most efficient way.

    Let's look at the main features of modern OLAP cubes, and what tasks they solve (Analysis Services 2005/2008 are taken as a basis):
    • quick access to data
    • preaggregation
    • hierarchies
    • work with time
    • multidimensional data access language
    • KPI (Key Performance Indicators)
    • mining date
    • tiered caching
    • multilingual support

    So, let's look at the capabilities of OLAP cubes in a little more detail.

    A bit more about features


    Quick access to data
    Actually fast access to data, regardless of the size of the array, is the basis of OLAP systems. Since this is the main focus, the data warehouse is usually built on principles different from the principles of relational databases.
    Here, the time taken to sample simple data is measured in fractions of a second, and a query that exceeds several seconds is likely to require optimization.

    Preaggregation
    In addition to quick sampling of existing data, it is also possible to preaggregate the “most likely-used” values. For example, if we have daily sales records for a product, the system mayto also aggregate monthly and quarterly sales to us, which means that if we request data monthly or quarterly, the system will immediately give us the result. Why is preaggregation not always happening - because theoretically possible combinations of goods / time / etc. there can be a huge amount, which means that you need to have clear rules for which elements the aggregation will be built, and for which not. In general, the topic of accounting for these rules and the actual direct design of aggregations is quite extensive and in itself deserves a separate article.

    Hierarchies
    It is logical that, analyzing the data and building the final reports, there is a need to take into account that the months consist of days, and themselves form quarters, and the cities are in the regions, which in turn are part of regions or countries. The good news is that OLAP cubes initially view data in terms of hierarchies and relationships with other parameters of the same entity, so building and using a hierarchy in cubes is very simple.

    Work with time
    Since the analysis of data is mainly carried out in time sections, it is time in OLAP systems that is especially important, which means that simply by defining the system where we have time here, you can easily use functions like Year To Date, Month To Date ( period from the beginning of the year / month to the current date), Parallel Period (on the same day or month, but last year), etc.

    Multidimensional Data Access Language
    MDX (Multidimensional Expressions) is a query language for simple and efficient access to multidimensional data structures. And that says it all - there will be a few examples below.

    Key Performance Indicators (KPI)
    Key Performance Indicators- This is a financial and non-financial assessment system that helps the organization determine the achievement of strategic goals. Key performance indicators can be quite simply defined in OLAP systems and used in reports.

    Date Mining
    Data Mining (Data Mining) - in fact, revealing hidden patterns or relationships between variables in large data sets.
    The English term “Data Mining” does not have an unambiguous translation into Russian (data mining, data mining, information penetration, data / information extraction) therefore, in most cases it is used in the original. The most successful indirect translation is considered the term "data mining" (IAD). However, this is a separate, no less interesting topic to consider.

    Multilevel Caching
    Actually, to ensure the highest data access speed, in addition to tricky data structures and pre-aggregations, OLAP systems support multilevel caching. In addition to caching simple queries, parts of data subtracted from the data warehouse, aggregated values, and calculated values ​​are also cached. Thus, the longer you work with the OLAP cube, the faster it, in fact, starts working. There is also the concept of "cache warming up" - an operation that prepares an OLAP system for working with specific reports, queries, or all together.

    Multilingual support
    Yes Yes Yes. At a minimum, Analysis Services 2005/2008 (true, Enterprise Edition) natively supports multilingualism. It is enough to bring the translation of the string parameters of your data, and the client who specified their language will receive localized data.

    Multidimensional cubes


    So what are these multidimensional cubes?
    Imagine a 3-dimensional space in which the axes Time, Goods and Buyers.
    The point in such a space will determine the fact that one of the buyers in a certain month bought a specific product.

    Multidimensional cube

    In fact, the plane (or the set of all such points) will be a cube, and, accordingly, Time, Goods and Buyers - its dimensions.
    To imagine (and draw) a four-dimensional or more cube is a little more difficult, but the essence does not change, and most importantly, for OLAP systems it does not matter how many dimensions you work (within reasonable limits, of course).

    A bit of MDX


    So, what is the charm of MDX - most likely that it is necessary to describe not how we want to select data, but what we want.
    For example, What does it mean - I want the number of iPhones sold in June and July in Mozambique. At the same time, I describe what kind of data I want and how exactly I want to see them in the report. Nice, isn't it? But a little more complicated: In fact, first we determine the formula for calculating the "average purchase size" and try to compare who (what gender) spends more money in one visit to the Apple store. The language itself is extremely interesting for study and use, and perhaps deserves a lot of discussion.

    SELECT
    { [Measures].[Units] } ON COLUMNS,
    { [Time].[June, 2009], [Time].[July, 2009] } ON ROWS
    FROM [Sales]
    WHERE ([Product].[iPhone], [Country].[Mozambik])

    * This source code was highlighted with Source Code Highlighter.









    WITH MEMBER AverageSpend AS
    [Measures].[Amount] / [Measures].[Transaction Count]
    SELECT
    { AverageSpend } ON COLUMNS,
    { [Customer].[Sex].[Female], [Customer].[Sex].[Male] } ON ROWS
    FROM [Sales]
    WHERE ([Shop].[Apple])

    * This source code was highlighted with Source Code Highlighter.






    Conclusion


    In fact, this article covers very little even basic concepts, I would call it “appetizer” - an opportunity to interest the habr community in this topic and develop it further. As for development, there is a huge, unplowed field, and I will be happy to answer all your questions.

    PS This is my first post about OLAP and the first publication on Habré - I will be very grateful for a constructive feedback.
    Update: Transferred to SQL, transferred to OLAP as soon as they are allowed to create new blogs.

    Also popular now: