Random databases. Oracle Enterprise Data Quality - Enterprise Storage Shield and Sword

    The thinking process of any person is difficult to mathematize. Any business task generates a set of formal and informal documents, the information from which is reflected in the corporate repository. Each task that generates any information process creates around itself a set of documents and the logic of their processing, which is little formalized in the corporate storage environment. There should be structures inside the data warehouse to clear the information flow. The Oracle Enterprise Data Quality product, which is designed to solve the tasks of cleaning "dirty" data, can help. But this is not limited to its use.

    1. The concept of a random database.

    The very first business connections of a person are described by formal and informal documents such as an application, a declaration, an employment contract, an application for placement, an application for a resource. These documents create logical connections between business processes, but, as a rule, are a product of the thinking of office managers and are poorly formalized.

    The task of any at least some complicated optimization is not only to understand the formal and informal rules, but, often, bring disparate knowledge to a common information base.

    Definition A random database is a set of facts, documents, manual notes, formal documents that are processed by a person for a specific business process, but cannot be fully automatically processed due to the strong influence of the human factor.

    Example. The secretary formally receives the call. The caller is interested in a product or service. The caller is not known for CRM. Question: what should the caller say in order to be heard by a specialist?

    To put it more precisely: how much do the secretary’s business instructions allow a formal dialogue about the business if the responsible specialist is not ready for this type of activity?

    It turns out that we again come to the definition of a random database.

    Maybe it contains more facts than the secretary can know. But the information received in it cannot be superfluous. In general, when random facts of a random database arrive at the input of a formalized system, then such a thing as information overload arises - and all information overload can affect the performance of not only the secretary, but the whole company.

    If it is used for processing purposes, then a machine that reads the state of this information comes on the basis of logical conclusions to the state opposite to the person - information overload. Human logic is more flexible.

    2. Application of the definition to real tasks.

    Imagine a store in which the price tags for random goods are noticeably high or low. When you leave this store in the head of an inexperienced shopping list will remain the price of 5-7 (or even 3) of the most popular goods, the price of which can affect the size of the total check. It turns out that if it were possible to know the list of goods, the price of which buyers most often recall, then the rest of the prices could be varied in a relatively wide range.

    Have you ever wondered why, before Lent, the meat at first becomes sharply cheaper, and then it can sharply rise in price, and then disappear? The price of a product, the demand for which may fall to zero, is first artificially heated, then, passing a certain level of demand, it begins to be fixed, and after a while it rises forcefully, as greed does not allow giving away illiquid goods at a fair price.

    An almost similar situation exists in the data market. The most useful information is almost always hidden by secondary hypotheses about its applicability and extractability.
    It is enough to lay out any information that is interesting to 5000-7000 people on any relatively unprotected resource, there are surely copy-paste sites.

    Or the famous game with phone codes “Who called me?”. About a thousand sites in Runet consist only of the phone numbers of various operators in order to be a little higher in the search results, trying to somehow sell the domain name and advertising more expensive.

    3. The price of the issue when working with "dirty" data.

    According to the research of the author of the article, up to 10% of the labor resources of each project is diverted to writing certain data cleaning procedures. If you don’t dwell on completely banal type and length, that is, unique identifiers, database integrity rules and business integrity rules, quantitative and qualitative unit scales, labor unit systems and any other states, influences, transitions, the preparation of which requires as usual statistical both logical and serious business analysis. Formalization of requirements comes to the need to formalize the fact-dimension relationship both for building repositories and for resolving issues on the front-end.

    Agree, if ETL processes occupy 70% of the operating time of any storage, then saving 5-7% of resources on the correct cleaning of data on a conditional storage of 200,000 customers is already a good bonus?

    We will cover a bit the issues of "dirty" data in ready-made systems. Let's say you send congratulations on a national holiday to 10,000 customers through the mail. How many people will throw your letter with the best postcard in the mailbox, if you make a mistake in the name, surname, or fill in the form incorrectly in the form? The price of your efforts can reduce the mood of any user to zero!

    4. Oracle Enterprise Data Quality - the shield and sword of corporate storage.

    The screenshots we provide describe the features of Oracle Enterprise Data Quality.

    So, let someone spill water onto your database or text document.


    Here is a list of standard processors (logical units that allow
    one or another hypothesis to be applied to the data, or search for the required one):


    Random database profiler action:


    Elementary check of financial solvency:


    Working with a postal code:


    Cleaning the mailing address:


    Clearing user data:


    Assigning entry to a particular confidence interval:


    Define user sex of circumstantial evidence:


    Determination of the city and the country, state:


    Simple search keys in a random database:


    deduplication OF DATA User:


    5. Funny observations made on the results of work on Oracle EDQ.

    One of the principles of comparing the contributions of writers and poets to literature is to compare their poetic and literary dictionaries. We give a number of dictionaries compiled in free time for tests of ready-made solutions on Oracle EDQ, Python, Java. We will be grateful if the philologists in the comments post their results.

    Number p.p.


    Word


    Frequency of occurrence


    Leo
    Tolstoy, "War and Peace." Fragment of the frequency table of the
    author's dictionary.


     


    I.
    Brodsky, Urania.


     


    I.
    Brodsky Complete works, a fragment of the
    author’s frequency dictionary .


     


    N.
    Nekrasov, a fragment of the frequency dictionary of the complete
    works.


     


    one.


    and


    10351


    at
    1037


    at
    5745


    and
    3420


    3.


    at


    5185


    and
    647


    and
    4500


    in
    2108


    four.


    not


    4292


    not
    391


    not
    3022


    not
    1726


    five.


    what


    3845


    at
    341


    at
    2239


    i
    1040


    6.


    he


    3730


    like
    329


    like
    1758


    from
    883


    7.


    on


    3305


    from
    237


    from
    1674


    at
    854


    eight.


    with


    3030


    what
    168


    what
    1531


    like
    763


    9.


    as


    2097


    to
    148


    And
    1200


    what
    693


    ten.


    I


    1896


    from
    147


    i
    1040


    he is
    644


    eleven.


    him


    1882


    out of
    104


    to
    922


    you
    475


    12.


    to


    1771


    i'm
    90


    from
    810


    but
    472


    13.


    then


    1600


    where
    88


    all
    748


    a
    449


    14.


    she is


    1564


    than
    88


    to
    744


    so
    383


    15.


    but


    1234


    for
    76


    you
    721


    to
    367


    sixteen.


    this


    1208


    of
    74


    B
    713


    all
    344


    17.


    said


    1135


    But
    72


    for
    687


    for
    313


    18.


    It was


    1125


    not
    70


    out of
    635


    i am
    309


    nineteen.


    So


    1032


    would
    69


    but
    617


    yes
    294


    20.


    the prince


    1012


    then
    67


    he is
    592


    its
    275


    21.


    behind


    985


    you
    67


    But
    584


    then
    232


    22.


    but


    962


    about
    66


    then
    540


    was
    229


    23.


    his


    918


    but
    63


    about
    538


    to
    224


    24.


    everything


    908


    there are
    61


    it's
    524


    no
    223


    25.


    by


    895


    I'm
    61


    I am
    489


    neither
    222


    26.


    her


    885


     


    a
    463


    about
    213


    27.


    of


    845


     


    where
    449


    their
    212


    28.


     


     


     


    than
    443


    out of
    209


    29.


     


     


     


    A
    428


    from
    207


    thirty.


     


     


     


    same
    422


    we are
    206




    Conclusion: the statistics of the Russian language over the past hundred years in terms of the frequency of individual words has not changed much, among poets - words are more “melodious”. By the way, Daria Dontsova’s statistics largely coincides with Leo Tolstoy in the field of the frequency dictionary of the complete works.

    6. Several formal calculations as a conclusion.

    About 60 thousand Ivanov Ivanov Ivanovich live in our country. Assuming that somewhere, hypothetically, 100 tables are stored in the average database, 10 key fields in each table, and each key can take 60 thousand values, we get that the total number of unique key states inside the database is about 60 million. Even if two keys get mixed up in one table, they can generate up to 20 unique states in one table. In total, up to several thousand can run into the base of unique states. Agree that spending 10% of development time and 5-7% of ETL execution time to catch such trifles is an impermissible luxury?

    UPD1If you are tired of dragging the control system for each more or less important directory in your work, then MDM (Master Data Management) systems will come to your aid. Of course, we deliver such systems to the market, including a version on free software.

    UPD2 Very often at conferences the question is asked: “How to create a cheaper data quality management system”. I ask you to consider this article a small introduction to this issue, with some simplification of EDQ functionality. Yes, and yet, you can take a bunch of ODI + EDQ and do it very well, but this is the subject of further narration.

    Also popular now: