Open webinar "Naive Bayes Classifier"


    As part of our Data Scientist course, we conducted an open lesson on the topic “Naive Bais Classifier”. The lesson was taught by the teacher of the course Maxim Kretov - a leading researcher in the laboratory of neural networks and deep learning (MIPT). We offer to get acquainted with the video and summary.

    Thank you in advance.


    Imagine that you have a thousand properties. As a rule, each of them can be characterized by a specific set of features, for example:

    • the area of ​​the house;
    • the amount of time elapsed since the last repair;
    • distance from the nearest public transport stop.

    Thus, each house can be represented as x with dimension 3. That is, x = (150; 5; 600), where 150 is the area of ​​the house in square meters, 5 is the number of years after the repair, 600 is the distance to the stop in meters. The price for which this house can be sold on the market will be indicated by y.

    As a result, we have a set of vectors, and each object corresponds to a variable. And if we talk about the price, then just her then you can learn to predict, possessing the skills of machine learning.

    Basic classification of machine learning methods

    The above example is rather typical and refers to machine learning with a teacher (there is a target variable). If the latter is absent, we can talk about machine learning without a teacher. These are the two main and most common types of machine learning. In this case, the task of training with a teacher, in turn, is divided into two groups:

    1. Classification. The target variable is one of the C-classes, i.e., each object is given a class label (cottage, garden house, household building, etc.).
    2. Regression. The target variable is a real number.

    What problems does machine learning solve?

    Today, using the methods of machine learning, the following tasks are solved:

    1. Syntax:

    • marking in parts of speech and morphological features;
    • division of words in the text into morphemes (prefix, suffix, etc.);
    • search for names and titles in the text ("recognition of named entities");
    • resolution of the meaning of words in a given context (a typical example is a lock or a lock).

    2. Tasks for understanding the text, in which there is a "teacher":

    • Machine translate;
    • conversational models (chat bots).

    3. Other tasks (image description, speech recognition, etc.).

    The difficulties of working with text

    Working with text from the point of view of machine learning always carries with it certain difficulties. For this it is enough to recall two sentences:

    • Mom washed the frame and now it shines;
    • mom washed the frame and now she is tired.

    If the classifier performing machine learning does not have common sense, it is equally true for him when the frame glitters and is tired, since syntactically the word frame in the second sentence is closer to the pronoun it.

    Practical task

    After providing general information about some aspects of machine learning, the teacher smoothly proceeded to the practical task of the webinar - classifying emails for spam and quality.

    First of all, an example was shown of how to convert the input text into a vector from numbers. For this:

    • a K-size dictionary was recorded;
    • each word in the text was represented as follows: (0, 0, 0, ... 0, 1, 0, ... 0).

    This approach is called 1-hot-encoding, and the words in its context are tokens.

    According to the results of this stage of data processing, a dictionary was created and word counters were made for each text. As a result, a vector of fixed length was obtained for each text. A simpler approach boolean mask was also considered.

    Familiarity with the Bayes classifier The

    naive Bayes classifier is based on the use of Bayes' theorem with strict (naive) assumptions about independence. Its advantage is the small amount of training data needed to evaluate the parameters required for classification.
    In interpreting the task of classifying emails, the main idea was as follows:

    • all words in the text are considered independently of each other;
    • If any words are found in spam more often than in good letters, these words are considered signs that the letter belongs to spam.

    Taking into account the Bayes theorem, the corresponding formulas for several variables were written out, and also the features of the calculation of additional assumptions were considered. To calculate the parameters, pseudo-code was used, after that a detailed model example was formed, where a priori probabilities and probabilities of belonging to classes for the new object x were calculated. The final stage of practical work is the construction and training of the model, as well as the measurement of quality.


    As always, we are waiting for questions and comments here, or you can ask the teacher directly by going to the open day .

    Also popular now: