What Professor Ng Didn't Teach Us

    As can be seen from the discussions on the Habré, dozens of Habrovsk citizens attended the course at ml-class.org at Stanford University, which was conducted by the charming professor Andrew Ng. I also enjoyed listening to this course. Unfortunately, a very interesting topic stated in the plan fell out of the lectures: the combination of teaching with a teacher and teaching without a teacher. As it turned out, Professor Ng published an excellent course on this topic - Unsupervised Feature Learning and Deep Learning (spontaneous feature extraction and deep learning). I offer a brief summary of this course, without strict exposition and an abundance of formulas. The original has it all.

    Italics are typed in my inset, which are not part of the original text, but I could not resist and included my own comments and considerations. I apologize to the author for shamelessly using illustrations from the original. I also apologize for the direct translation of certain terms from English (for example, spatial autoencoder -> sparse auto-encoder). We Stanford do not know Russian terminology well :)

    Sparse Auto Encoder


    The most commonly used neural networks of direct distribution are intended for training with a teacher and are used, for example, for classification. An example of the topology of such a neural network is shown in the figure: The



    training of such a neural network is usually carried out by the method of back propagation of the error in such a way as to minimize the root-mean-square error of the network response in the training sample. Thus, the training set contains pairs of feature vectors (input data) and reference vectors (labeled data) {(x, y)} .

    Now imagine that we don’t have labeled data - just a set of feature vectors {x}. An auto encoder is a teacherless learning algorithm that uses a neural network and a back propagation method to ensure that the input feature vector elicits a network response equal to the input vector, i.e. y = x . Auto Encoder Example:



    Auto Encoder is trying to build the function h (x) = x . In other words, it tries to find an approximation of such a function so that the response of the neural network is approximately equal to the value of the input features. In order for the solution of this problem to be non-trivial, special conditions are imposed on the network topology:

    • The number of neurons in the hidden layer must be less than the dimension of the input data (as in the figure), or
    • The activation of neurons in the hidden layer must be sparse.

    The first limitation allows data compression when transmitting an input signal to the network output. For example, if the input vector is a set of image brightness levels of 10x10 pixels (100 attributes in total), and the number of neurons in the hidden layer is 50, the network is forced to learn image compression. Indeed, the requirement h (x) = x means that, based on the activation levels of fifty neurons of the hidden layer, the output layer should restore 100 pixels of the original image. Such compression is possible if the data has hidden relationships, correlation of attributes, and generally some kind of structure. In this form, the operation of the auto-encoder is very similar to the method of analysis of the main components (PCA) in the sense that the dimensionality of the input data is reduced.

    The second limitation - the requirement of sparse activation of hidden layer neurons - allows you to get non-trivial results even when the number of hidden layer neurons exceeds the dimension of the input data. If we describe sparseness informally, we will consider a neuron active when the value of its transfer function is close to 1. If a sigmoid transfer function is used, but for an inactive neuron its value should be close to 0 (for the hyperbolic tangent function, to -1). Sparse activation is when the number of inactive neurons in the hidden layer significantly exceeds the number of active ones.

    If we calculate the p value as the average (over the training sample) value of the activation of hidden layer neurons, we can introduce an additional penalty term into the objective function used in gradient learning of a neural network by the back propagation method. The formulas are in the original lectures, and the meaning of the penalty coefficient is similar to the regularization technique in calculating the regression coefficients: the error function increases significantly if p differs from the predetermined sparseness parameter. For example, we may require that the average activation value for the training sample be 0.05.

    The requirement of sparse activation of hidden layer neurons has a striking biological analogy. The author of the original theory of the structure of the brain, Jeff Hawkins, notes the fundamental importance of inhibitory connections between neurons (see Russian text Hierarchical temporal memory (HTM) and its cortical learning algorithms) In the brain between neurons located in the same layer there is a large number of "horizontal connections". Although neurons in the cerebral cortex are very tightly interconnected, numerous inhibitory (inhibitory) neurons guarantee that only a small percentage of all neurons will be active at a time. That is, it turns out that information is always presented in the brain by only a small number of active neurons from all available there. This, apparently, allows the brain to make generalizations, for example, to perceive the image of a car from any angle just like a car.

    Visualization of hidden layer functions


    Having trained the auto encoder on an unlabeled dataset, you can try to visualize the functions approximated by this algorithm. The visualization of the above example of training the encoder on 10x10 pixel images is very clear. We ask ourselves: “What combination of input x will cause the maximum activation of hidden neuron number i?” That is, what set of signs of input data is each of the hidden neurons looking for?

    A non-trivial answer to this question is contained in the lecture , and we restrict ourselves to the illustration that was obtained when visualizing the network function with a hidden layer of 100 neurons.



    Each square fragment represents an input image x, which maximally activates one of the hidden neurons. Since the corresponding neural network was trained using examples of natural images (for example, fragments of nature photos), neurons of the hidden layer independently studied the functions of detecting contours from different angles!

    In my opinion, this is a very impressive result. The neural network independently, by observing a large number of diverse images, built a structure similar to biological structures in the brain of humans and animals. As can be seen from the following illustration from the great book “From the Neuron to the Brain” by J. Nichols and others, this is how the lower visual sections of the brain are arranged:



    The figure shows the responses of complex cells in the cat's striped cortex to visual stimuli. The cell responds in the best way (most of the output pulses) to the vertical border (see the first fragment). The reaction to the horizontal border is practically absent (see the third fragment). A complex cell in the striped cortex approximately corresponds to our trained neuron in the hidden layer of the artificial neural network.

    The entire set of neurons of the hidden layer learned to detect contours (boundaries of the difference in brightness) at different angles - just like in a biological brain. The following illustration from the book “From Neuron to the Brain” schematically shows the orientation axis of the receptive fields of neurons as they plunge deep into the cortex of the cat’s brain. Similar experiments helped to establish that cells with similar properties in cats and monkeys are arranged in columns that go at certain angles to the surface of the cortex. Individual neurons in the column are activated by visual irritation of the corresponding section of the animal’s field of vision with a black stripe on a white background, rotated at a certain angle specific to each neuron in the column.



    Self learning


    The most effective way to get a reliable machine learning system is to provide the learning algorithm with as much data as possible. According to the experience of solving large-scale problems, a qualitative transition occurs when the volume of the training sample exceeds 1-10 million samples. You can try to get more labeled data for training with a teacher, but this is not always possible (and cost-effective). Therefore, the use of unmarked data for self-learning of neural networks seems promising.

    Unmarked data contains less information for training than labeled data. But the amount of data available for learning without a teacher is much greater. For example, in image recognition tasks, an unlimited number of digital photographs are available on the Internet, and only a negligible percentage of them are marked.

    In self-learning algorithms, we will give our neural network a large amount of unmarked data, from which the network will learn to extract useful features. Further, these features can be used to train specific classifiers using relatively small labeled training samples.

    Let us depict the first stage of self-training in the form of a neural network with 6 inputs and three neurons of a hidden layer. Outputs and these neurons will be generalized symptoms, which are extracted from unmarked data algorithm rarefied avtoenkodera.

    Now you can train the output layer of the neural network, or use logistic regression, a support vector machine, or the softmax algorithm to train the classification based on the selected features a. For these traditional algorithms, the marked training sample xm will be used as input . Two variants of the topology learning network:

    • The input to a conventional classifier (e.g., the output layer of the neural network) are fed only signs and ;
    • At the input of the traditional classifier (for example, the output layer of the neural network) signs a and input signs xm are given .

    Further, in Ng lectures, multilayer networks and the use of auto-encoders for their training are discussed.

    Also popular now: