Improving Data Quality with Oracle Enterprise Data Quality

    In most Siebel implementation projects, one way or another, one has to deal with data quality issues. In this part, Oracle offers an interesting solution - Enterprise Data Quality with the ability to integrate into Siebel (which actually attracted us). In this article, I will briefly talk about the product itself, its architecture, and also show how you can create a simple process to improve data quality.

    Product Brief


    EDQ - a product that allows you to control the quality of information. EDQ analysis can be based on various data sources, such as:

    • Databases (Oracle, Postgres, DB2, MySql, etc.)
    • text files
    • XML files
    • MS Office files,
    • system files and more.

    EDQ allows you to analyze data, find inaccuracies, gaps and errors in it. You can make adjustments, transform and transform information to improve its quality. You can integrate data quality control rules into ETL tools, thereby preventing inconsistent information from appearing. A big plus when using EDQ is the intuitive interface, as well as the ease of changing and expanding the rules of validation and transformation. An important EDQ tool is Dashboards, which allow business users to track trends in data quality. For example, as in the screenshot below:


    EDQ is a Java web application that uses the Java Servlet Engine, the Java Web Start graphical user interface, and the database management system for data storage.

    EDQ provides a number of client applications that allow you to manage the system:


    When you visit the Enterprise Data Quality Launchpad page, you can see application data, for example, Director, who is responsible for the design and processing of information. When you click on the button, we are asked to save the file:


    Opening it, we find ourselves in the main design interface:


    Data snapshots, processed information, and necessary metadata are stored on the EDQ server. Client machines only store user settings. EDQ uses a repository, which is stored in two database schemas - in the configuration and in the results schema.

    Information on EDQ settings is stored in the configuration diagram, and information snapshots (intermediate processing results and final results of the process execution) are stored in the results diagram.

    What tasks does EDQ solve?


    Examples of tasks solved using EDQ:

    • deduplication and consolidation of information,
    • bringing information to a single form (spelling the name of one country in different ways, etc.),
    • identification of information entered in the wrong field,
    • address parsing
    • breakdown of fields in which information about several attributes is encoded.

    On the basis of connections to different data sources, the so-called “Staged Data” are created - data snapshots for subsequent analysis.

    EDQ has many processors with which you can transform the original cast of information.


    A common problem when using information is the presence of duplicates. As mentioned above, EDQ allows you to solve it. To eliminate duplicates, you can use the Duplicate Check processor, in its settings you need to select the field by which information will be grouped:


    This processor allows you to select data cleared of duplicates for further analysis:


    Simple deduplication process


    You can use the Group and Merge processor to merge duplicate lines. For example, consider its functionality.

    Having placed the processor in the workspace and connected it to the processor that selects the initial information, you need to go to the “Group and Merge” settings.

    This processor has three subprocessors:

    • Inputs
    • Group
    • Merge

    Inputs - allows you to select the fields that you want to convert from the source data set. Drag and drop the desired columns using the buttons:


    Based on the selected attributes, we will select the field or fields based on which the grouping will be performed.

    Group - in this processor, you must select the fields on the basis of which the source data will be grouped:


    And the last sub-processor is Merge, which allows you to configure rules for merging each attribute:


    By default, the rule value is set to “Most Common Value”, but this value can be replaced if necessary with the appropriate ones from the list:


    After setting up and starting the processor, we see the result of its work:


    In the original set there was a different amount of data:


    These transformations are performed at the data nugget level in the EDQ and do not affect the data in the database. In order to display the received cleared information in the database, you need to perform a few more steps. Based on the information received from the Group and Merge processor, you need to create a new nugget of information, which we will subsequently upload to the database. Select the processor "Writer":


    And click on the New Staged Data button. It is further proposed to create a new cast based on the information received after deduplication. The resulting screen should look like this:


    You can also use existing casts that will be overwritten after the process starts.

    After recording the deduplicated information, you need to create a new export that will record our changes to the database:


    In the interface above, it is proposed to select a prepared data cast and a table into which it is planned to be loaded. By default, this process will delete existing data and insert new ones, but you can configure it. To do this, create a new Job.

    Drag the created process to the workspace:


    And the data export created:


    Then you can configure the export. Double-click on the menu:


    Having chosen the necessary value, we launch the task into work and check the result in the database.

    Conclusion


    EDQ provides the user with a wide range of tools, but there are, as they say, some nuances. For example, these include the lack of pre-configured templates for processing information. Indeed, most of the problems associated with data quality are typical for many projects. It is worth saying that some ready-made processes can still be found and integrated into your project. Actually, this is what we plan to do - to prepare a package of processes that can be used on different projects.

    To use the phonetic algorithms presented in EDQ, only Latin is suitable, but you can always use transliteration and pre-configure your data for these algorithms.

    Another, so to say, not quite convenient feature that we noticed is the not-so-transparent process of uploading processing results to the database. If you use the database table as the source, then it would be easier when you get the result immediately to be able to replace the original information with the cleared.

    As mentioned at the beginning, most projects that use data warehouses in one way or another “suffer” from the quality of the information stored in them. EDQ can help in solving the problems of not only the deduplication that we talked about here, but also many others. A big plus of this product is the availability of a variety of processors ready for use. And dashboards with a simple interface will help business users take an active part in data quality control.

    In the following articles, we will try to consider the example of the process of standardization of addresses according to FIAS and the creation of dashboards, as well as integration with Oracle Siebel. Follow the updates.

    Anton Akinshin, developer of the Jet Infosystems business systems implementation center. Please ask questions in the comments.

    Also popular now: