sashaeve June 2, 2010 at 01:00

Data mining: what's inside

Information levels

I do not think that I will discover America if I say that not all information is equally useful. Sometimes, to explain a concept, it is necessary to write a lot of text, and sometimes to explain complex issues it is enough to look at a simple diagram. To reduce the redundancy of information, mathematical formulas, drawings, conventions, program code, etc. were invented. In addition, it is important not only the information itself, but also its presentation. It is clear that stock quotes can be more clearly demonstrated using a graph, and mathematical formulas describe Newton's laws in a more compact form.

In the process of developing information technologies, as well as data collection and storage systems - databases, databases, data warehousing, and more recently, cloud repositories, the problem of analyzing large amounts of data has arisen when the analyst or manager is not able to manually process large amounts of data and make a decision. It is clear that the analyst needs to somehow provide the source information in a more compact form, which the human brain can handle in an acceptable amount of time.

There are several levels of information:

initial data (raw data, historical data, or simply data) - raw data arrays obtained as a result of monitoring a certain dynamic system or object and displaying its status at specific points in time (for example, stock price data for the past year);
information - processed data that carries a certain information value for the user; raw data presented in a more compact form (for example, search results);
knowledge - they carry a certain know-how, display hidden relationships between objects that are not publicly available (otherwise, it will be just information); data with high entropy (or measure of uncertainty).

Consider an example. Suppose we have some data on foreign exchange transactions in the Forex market for a certain period of time. This data can be stored in text form, in XML format, in a database or in binary form and by themselves do not carry any useful semantic load. Further, the analyst downloads this data, for example, in Excel and builds a schedule of changes, thus obtaining information. Then he loads the data (fully or partially processed in Excel), for example, into Microsoft SQL Server and with the help of Analysis Services he gets the knowledge that tomorrow it is better to sell shares. After that, the analyst can use the knowledge already gained for new assessments, thus obtaining feedback in the information process.

There are no clear boundaries between the levels, but such a classification will allow us not to get confused with the terminology in the future.

Data mining

Historically, the term Data Mining has several translation options (and meanings):

extraction, data collection, data mining (still using Information Retrieval or IR);
knowledge extraction, data mining (Knowledge Data Discovery or KDD, Business Intelligence).

IR operates with the first two levels of information, respectively, KDD works with the third level. If we talk about implementation methods, the first option relates to the application field, where the main goal is the data itself, the second to mathematics and analytics, where it is important to get new knowledge from a large amount of existing data. Most often, data extraction (collection) is a preparatory step for knowledge extraction (analysis).

I dare to introduce another term for the first paragraph - Data Extracting , which I will use in the future.

Tasks solved by Data Mining:

Classification - assignment of an input vector (object, event, observation) to one of the previously known classes.
Clustering is the division of the set of input vectors into groups (clusters) according to the degree of "similarity" to each other.
Description reduction - for data visualization, simplification of calculation and interpretation, compression of volumes of information collected and stored.
Association - search for duplicate patterns. For example, the search for “sustainable relationships in the shopping cart.”
Prediction - finding the future state of an object based on previous states (historical data)
Deviation analysis - for example, detecting atypical network activity can detect malware.
Data visualization.

Information retrieval

Information retrieval is used to obtain structured data or a smaller representative sample. According to our classification, information retrieval operates with data of the first level, and as a result provides information of the second level.

The simplest example of information retrieval is a search engine, which, based on certain algorithms, displays some of the information from a complete set of documents. In addition, any system that works with test data, meta-information or databases in one way or another uses the information retrieval tools. Tools can be indexing, filtering, data sorting methods, parsers, etc.

Text mining

Other names: text data mining, text analysis, a very close concept - concern mining.

Text mining can work both with raw data and partially processed, but unlike information retrieval, text mining analyzes text information using mathematical methods, which allows you to get a result with knowledge elements.

The tasks that text mining solves: finding data templates, obtaining structured information, building object hierarchies, classifying and clustering data, determining the subject or area of knowledge, automatically abstracting documents, tasks for automatically filtering content, determining semantic relationships and others.

To solve text mining problems, statistical methods, interpolation, approximation and extrapolation methods, fuzzy methods, content analysis methods are used.

Web mining

And finally, we got to web mining - a set of approaches and techniques for extracting data from web resources.
Since web sources, as a rule, are not textual data, the approaches to the data extraction process are different in this case. First of all, you need to remember that the information on the web is stored in the form of a special HTML markup language (although there are other formats - RSS, Atom, SOAP, but we'll talk about this later), web pages can have additional meta-information, as well as information about the structure (semantics) of a document, each web document is inside a certain domain and the rules of search engine optimization (SEO) can be applied to it.

This is the first article in a series on data mining / extracting / web mining. Wishes and reasoned criticism are accepted.

Tags: