Extract data or knowledge?


    It became curious how the topic of Data Mining is presented on the hub I saw only one article devoted to this topic. I want to make my small contribution to the development of this topic.

    Historically, the term Data Mining has several translation options:
    • data extraction
    • knowledge extraction, data mining

    If we talk about implementation methods, the first option relates to the applied field, the second to mathematics and science, and, as a rule, they overlap little. If we talk about the possibility of application - there are a lot of options. It so happened that I worked both with the first option (at the university - scientific work), and with another (work, freelance). Let's consider in more detail.

    Data extraction

    Data extraction is the process of finding, collecting information, as well as storing (converting) them in different formats. In simple terms, data extraction programs are called parsers, grabbers, spiders, crawlers, etc. In fact, such programs make life easier for everyone, as they allow you to systematize data (namely data, not knowledge!). Such programs can collect addresses of companies in your industry, links from the necessary forums, parse entire catalogs, can also serve as an excellent tool for compiling databases.

    Having been doing this for a long time, I can say that there are a lot of applications of data mining in this sense. As a rule, data is taken from open sources without violating anyone's intellectual rights.

    • compiling a list of banks of a country
    • compiling a school base
    • list of sites on a specific topic

    Basically, this is a “list”, “catalog”, “base” of something that you need at the moment.

    In the following publications I will talk about real examples in more detail.

    Knowledge extraction

    The essence of “knowledge extraction": we have huge amounts of data , we need to get knowledge . Life example: we have a lot of data on Forex currency quotes (a lot is about several gigabytes of textual information per day). So, text files are data, but the statement “the fall of stock A leads to the fall of stock B” is already knowledge obtained on the basis of these data. Needless to say, the availability of convenient tools for obtaining this kind of knowledge would help more than one manager to make decisions.

    Main categories of Data Mining:
    • data clustering (separation of objects into similar groups)
    • data classification (assignment of objects to predefined groups)
    • neural networks, genetic algorithms (universal optimizers)
    • associative rules (rules referring to "if ... then ...")
    • decision trees
    • time series analysis

    I would also include here regression, multivariate and other analyzes, since they can also be used to solve similar problems. Each of these categories has its own mathematical and algorithmic apparatus and allows you to solve a certain range of problems.

    What do we have at the moment?

    To be honest, not very thick, but still:

    The rest are snippets of data, examples, and code scattered across the network.

    Data Mining Source Code

    Being a .NET developer, I needed examples of implemented algorithms in this language, but in 90% of cases it was either C ++ (mainly under Linux) or Java. The problem of the lack of examples in C # (or VB.NET) made me write everything myself.

    Most of all I wanted to systematize what I had and what I managed to find on the Internet. Thus, an open source project on codeplex called Data Mining Source Code appeared and, as a small explanation for this project, “Data Minig Source Code Blog” . There are sources in C #, VB.NET, Java, and JavaScript, although most sources in C #. To him there is an additional project, Numerical Methods on C # , which implements a large number of numerical methods.

    The projects are not commercial, I just liked it (and I also needed to study at the university), that's why I post them in the public domain. Projects live and now, students who need to get programming experience are working on them, so if someone has source code lying around or has a desire to learn algorithms and methods, you can join and send your own ideas.

    Well, in the end I would like to ask how interesting this topic is and which of the above you would like to read more in more detail?

    Also popular now: