
Introduction to Machine Learning
1.1 Introduction
Thanks to machine learning, the programmer is not required to write instructions that take into account all possible problems and contain all the solutions. Instead, a computer (or a separate program) is laid out with an algorithm for independently finding solutions through the integrated use of statistical data from which patterns are derived and based on which forecasts are made.
The technology of machine learning based on data analysis dates back to 1950, when they began to develop the first programs for playing checkers. Over the past decades, the general principle has not changed. But thanks to the explosive growth in computing power of computers, the laws and forecasts created by them have become more complicated and the range of problems and problems solved using machine learning has expanded.
To start the machine learning process, first you need to download the Dataset (a certain amount of input data) into the computer, on which the algorithm will learn to process requests. For example, there may be photographs of dogs and cats that already have tags indicating who they belong to. After the training process, the program itself will be able to recognize dogs and cats in new images without tag content. The learning process continues even after the forecasts are issued, the more data we have analyzed by the program, the more accurately it will recognize the desired images.
Thanks to machine learning, computers learn to recognize in photographs and drawings not only faces, but also landscapes, objects, text and numbers. As for the text, here one can not do without machine learning: the function of checking grammar is now present in any text editor and even in phones. Moreover, not only the spelling of words is taken into account, but also the context, shades of meaning and other subtle linguistic aspects. Moreover, there already exists software capable of writing news articles (on economics and, for example, sports) without human intervention.
1.2 Types of Machine Learning Tasks
All tasks solved using ML fall into one of the following categories.
1) The task of regression is a forecast based on a sample of objects with various attributes. The output should be a real number (2, 35, 76.454, etc.), for example, the price of an apartment, the value of a security after six months, the expected income of the store for the next month, the quality of wine during blind testing.
2) The task of classification is to obtain a categorical answer based on a set of features. It has a finite number of answers (usually in the “yes” or “no” format): is there a cat in the photograph, is the image a human face, is the patient ill with cancer.
3) The task of clustering- distribution of data into groups: dividing all customers of a mobile operator by solvency level, assigning space objects to one or another category (planet, stars, black hole, etc.).
4) The task of reducing the dimension is to reduce a large number of features to a smaller one (usually 2-3) for the convenience of their subsequent visualization (for example, data compression).
5) The problem of detecting anomalies- separation of anomalies from standard cases. At first glance, it coincides with the classification task, but there is one significant difference: anomalies are a rare phenomenon, and the training examples, on which you can train a machine-learning model to identify such objects, are either vanishingly small or simply not, therefore classification methods do not work here . In practice, such a task is, for example, identifying fraudulent activities with bank cards.
1.3 Basic types of machine learning
The bulk of the tasks solved using machine learning methods relates to two different types: learning with a teacher (supervised learning) or without him (unsupervised learning). However, this teacher is not necessarily the programmer himself, who stands above the computer and controls every action in the program. "Teacher" in terms of machine learning is the human intervention itself in the process of processing information. In both types of training, the machine provides initial data that it will have to analyze and find patterns. The only difference is that when learning with a teacher, there are a number of hypotheses that need to be refuted or confirmed. This difference is easy to understand with examples.
Machine learning with a teacher
Suppose we had information about ten thousand Moscow apartments: area, floor, district, presence or absence of parking at the house, distance from the metro, apartment price, etc. We need to create a model that predicts the market value of an apartment by its parameters. This is an ideal example of machine learning with a teacher: we have the initial data (the number of apartments and their properties, which are called signs) and a ready answer for each of the apartments is its cost. The program has to solve the regression problem.
Another example from practice: to confirm or deny the presence of cancer in a patient, knowing all of his medical indicators. Find out if an incoming message is spam by analyzing its text. These are all classification tasks.
Machine learning without a teacher
In the case of training without a teacher, when ready-made “right answers” are not provided to the system, everything is even more interesting. For example, we have information about the weight and height of a certain number of people, and these data need to be divided into three groups, each of which will have to sew shirts of suitable sizes. This is a clustering task. In this case, it is necessary to divide all the data into 3 clusters (but, as a rule, there is no such strict and only possible division).
If we take a different situation, when each of the objects in the sample has hundreds of different features, then the main difficulty will be the graphical display of such a sample. Therefore, the number of signs is reduced to two or three, and it becomes possible to visualize them on a plane or in 3D. This is the task of reducing dimension.
1.4 Basic algorithms of machine learning models
1. Decision tree
This is a decision support method based on the use of a tree graph: a decision model that takes into account their potential consequences (with the calculation of the probability of the occurrence of an event), efficiency, and resource consumption.
For business processes, this tree consists of a minimum number of questions that require a definite answer - “yes” or “no”. Consistently giving answers to all these questions, we come to the right choice. The methodological advantages of the decision tree are that it structures and systematizes the problem, and the final decision is made on the basis of logical conclusions.
2. Naive Bayesian classification
Naive Bayes classifiers belong to the family of simple probabilistic classifiers and originate from Bayes theorem, which in this case considers functions as independent (this is called a strict or naive assumption). In practice, it is used in the following areas of machine learning:
- Detection of spam emails
- automatic linking of news articles to thematic sections;
- identification of the emotional coloring of the text;
- recognition of faces and other patterns in images.
3. Least squares method
Anyone who has studied statistics at least a little is familiar with the concept of linear regression. The least squares are also related to its implementation options. Typically, linear regression solves the problem of fitting a straight line that passes through many points. Here's how to do it using the least squares method: draw a straight line, measure the distance from it to each of the points (connect the points and the line with vertical segments), transfer the resulting amount up. As a result, the curve in which the sum of the distances will be the smallest is the desired one (this line will go through points with a normally distributed deviation from the true value).
A linear function is usually used in the selection of data for machine learning, and the least squares method is used to minimize errors by creating an error metric.
4. Logistic regression
Logistic regression is a way of determining the relationship between variables, one of which is categorically dependent and the others are independent. For this, the logistic function (accumulative logistic distribution) is used. The practical significance of logistic regression is that it is a powerful statistical method for predicting events, which includes one or more independent variables. This is in demand in the following situations:
- credit scoring;
- measuring the success of ongoing advertising campaigns;
- profit forecast for a certain product;
- assessment of the probability of an earthquake on a specific date.
5. The support vector method (SVM)
This is a whole set of algorithms necessary for solving classification and regression analysis problems. Based on the fact that an object located in N-dimensional space belongs to one of two classes, the support vector method constructs a hyperplane with dimension (N - 1) so that all objects appear in one of two groups. On paper, this can be represented as follows: there are points of two different types, and they can be linearly divided. In addition to separating points, this method generates a hyperplane in such a way that it is as far as possible from the nearest point of each group.
SVM and its modifications help solve such complex machine learning tasks as DNA splicing, determining a person’s gender from photographs, and displaying advertising banners on websites.
6. The method of ensembles
It is based on machine learning algorithms that generate many classifiers and separate all objects from newly received data based on their averaging or voting results. Initially, the ensemble method was a special case of Bayesian averaging, but then it became more complicated and was overgrown with additional algorithms:
- boosting - converts weak models into strong ones by forming an ensemble of classifiers (from a mathematical point of view, this is an improving intersection);
- bagging (bagging) - collects sophisticated classifiers, while simultaneously training the basic (improving association);
- error correction of output coding.
The ensemble method is a more powerful tool compared to stand-alone forecasting models, because:
- it minimizes the influence of accidents by averaging the errors of each base classifier;
- reduces dispersion, since several different models based on different hypotheses are more likely to arrive at the correct result than one taken separately;
- excludes going beyond the scope of the set: if the aggregated hypothesis is outside the set of basic hypotheses, then at the stage of formation of the combined hypothesis it expands using one method or another, and the hypothesis is already included in it.
7. Clustering Algorithms
Clustering is the distribution of many objects into categories so that in each category - the cluster - the most similar elements are found.
You can cluster objects using different algorithms. The most commonly used are:
- based on the center of gravity of the triangle;
- based on connection;
- dimensional reduction;
- density (based on spatial clustering);
- probabilistic;
- machine learning, including neural networks.
Clustering algorithms are used in biology (the study of the interaction of genes in a genome with up to several thousand elements), sociology (processing the results of sociological studies using the Ward method, yielding clusters with minimal dispersion and about the same size) and information technology.
8. Principal Component Method (PCA)
The principal component method, or PCA, is a statistical operation for orthogonal transformation, which aims to translate observations of variables that can be somehow interconnected into a set of principal components - values that not linearly correlated.
The practical tasks in which the PCA is applied are visualization and most of the procedures for compressing, simplifying, minimizing data in order to facilitate the learning process. However, the method of principal components is not suitable for situations when the initial data are poorly ordered (that is, all components of the method are characterized by high dispersion). So its applicability is determined by how well the subject area is studied and described.
9. Singular decomposition
In linear algebra, a singular decomposition, or SVD, is defined as the decomposition of a rectangular matrix consisting of complex or real numbers. So, a matrix M of dimension [m * n] can be decomposed in such a way that M = UΣV, where U and V are unitary matrices, and Σ is diagonal.
One of the special cases of singular decomposition is the principal component method. The very first computer vision technologies were developed on the basis of SVD and PCA and worked as follows: first, faces (or other patterns that were to be found) were represented as the sum of the basic components, then their dimensions were reduced, after which they were compared with images from the sample. Modern algorithms of singular decomposition in machine learning, of course, are much more complex and sophisticated than their predecessors, but their essence has changed in general.
10. Independent Component Analysis (ICA)
This is one of the statistical methods that reveals hidden factors that affect random variables, signals, etc. ICA generates a generative model for multi-factor data bases. Variables in the model contain some hidden variables, and there is no information about the rules for mixing them. These hidden variables are independent sample components and are considered non-Gaussian signals.
In contrast to the analysis of the main components, which is associated with this method, the analysis of independent components is more effective, especially in those cases when the classical approaches are powerless. It discovers the hidden causes of phenomena and, due to this, has found wide application in various fields, from astronomy and medicine to speech recognition, automatic testing and analysis of the dynamics of financial indicators.
1.5 Real-life examples
Example 1. Diagnosis of diseases
Patients in this case are objects, and signs are all their symptoms, medical history, test results, treatment measures already taken (in fact, the entire medical history, formalized and divided into separate criteria). Some signs — gender, presence or absence of headache, cough, rash, and others — are considered binary. Assessment of the severity of the condition (extremely severe, moderate, etc.) is an ordinal sign, and many others are quantitative: the volume of the drug, the level of hemoglobin in the blood, indicators of blood pressure and pulse, age, weight. Having collected information about the patient’s condition, containing many of these signs, you can download it to a computer and, using a program capable of machine learning, solve the following problems:
- conduct differential diagnosis (determining the type of disease);
- choose the most optimal treatment strategy;
- to predict the development of the disease, its duration and outcome;
- calculate the risk of possible complications;
- identify syndromes - sets of symptoms associated with a given disease or disorder.
No doctor is able to process the entire array of information for each patient instantly, summarize a large number of other similar medical records and immediately give a clear result. Therefore, machine learning is becoming an indispensable tool for doctors.
Example 2. Search for the occurrence of minerals.
The signs here are information obtained by geological exploration: the presence of any rocks on the territory (and this will be a sign of the binary type), their physical and chemical properties (which are decomposed into a number of quantitative and quality features).
For the training sample, 2 types of precedents are taken: areas where mineral deposits are precisely present, and areas with similar characteristics where these minerals were not found. But the extraction of rare minerals has its own specifics: in many cases, the number of signs significantly exceeds the number of objects, and the methods of traditional statistics are poorly suited for such situations. Therefore, in machine learning, the emphasis is on detecting patterns in an already collected data array. For this, small and most informative sets of features are determined that are most indicative of answering the research question - is there a particular fossil in the indicated area or not. One can draw an analogy with medicine: deposits can also reveal their own syndromes.
Example 3. Assessment of the reliability and solvency of candidates for loans.
This task is faced daily by all banks involved in issuing loans. The need to automate this process is long overdue, back in the 1960-1970s, when the credit card boom began in the United States and other countries.
Persons requesting a loan from a bank are objects, but the signs will differ depending on whether this individual or legal entity. A characteristic description of a private individual applying for a loan is formed on the basis of the data of the questionnaire that it fills out. Then the questionnaire is supplemented with some other information about the potential client that the bank receives through its channels. Some of them relate to binary signs (gender, availability of a phone number), others to ordinal signs (education, position), most are quantitative (loan amount, total amount owed by other banks, age, number of family members, income, seniority ) or nominal (name, name of the employing company, profession, address).
For machine learning, a sample is drawn up, which includes borrowers whose credit history is known. All borrowers are divided into classes, in the simplest case there are 2 of them - “good” borrowers and “bad” ones, and a positive decision on granting a loan is made only in favor of the “good” ones.
A more sophisticated machine learning algorithm, called credit scoring, provides for the accrual of conditional points for each borrower for each attribute, and the decision to grant a loan will depend on the amount of points scored. During machine learning, credit scoring systems first assign a certain number of points to each characteristic, and then determine the conditions for issuing a loan (term, interest rate and other parameters that are reflected in the loan agreement). But there is also another system learning algorithm - based on use cases.
PS In the following articles, we will examine in more detail the algorithms for creating machine learning models, including the mathematical part and implementation in Python.