m1rko October 11, 2017 at 13:53

When to Avoid Deep Learning

Transfer

I understand that it is strange to start a blog with a negative, but over the past few days a wave of discussions has risen, which correlates well with some topics that I have been thinking about lately. It all started with a post by Jeff Lick on the Simply Stats blog with a caution about using deep learning on a small sample size. He argues that with a small sample size (which is often observed in biology), linear models with a small number of parameters work more efficiently than neural networks even with a minimum of layers and hidden blocks.

He further shows that a very simple linear predictor with ten of the most informative features works more efficiently than a simple neural network in the classification of zeros and ones in the MNIST dataset, using only about 80 samples. This article encouraged Andrew Beam to write a rebuttal in which a properly trained neural network managed to surpass a simple linear model, even on a very small number of samples.

Such disputes are taking place against the background of the fact that more and more researchers in the field of biomedical informatics are applying in-depth training on various tasks. Is the excitement justified, or are linear models enough for us? As always, there is no single answer. In this article, I want to consider cases of the use of machine learning, where the use of deep neural networks does not make sense at all. And also to talk about common prejudices, which, in my opinion, prevent really effective use of deep learning, especially for beginners.

Destruction of the prejudices of deep learning

First, let's talk about some prejudices. It seems to me that they are present in the majority of specialists who are not too knowledgeable in the subject of deep learning, but in fact are half-truths. There are two very common and one slightly more technical prejudices - we will dwell on them in more detail. This is in some way a continuation of the magnificent chapter of “Delusions” in an article by Andrew Beam .

Deep learning can really work on small sample sizes

In-depth training has become famous for the efficient processing of a large amount of input data (remember that the first Google Brain project involved loading a large number of YouTube videos into the network), and since then it has been constantly described as complex algorithms that work on a large amount of data. Unfortunately, this pair of big data and deep learning somehow led people to the opposite thought: the myth that deep learning cannot be used on small samples.

If you have only a few samples, launching a neural network with a high ratio of parameters per sample at first glance may seem like a direct road to retraining. However, simply taking into account the sample size and dimension for this particular problem, when learning with or without a teacher, is something like modeling data in a vacuum, without context. But you need to consider that in such cases, you have relevant data sources or convincing preliminary data that an expert in this field can provide, or the data is structured in a very specific way (for example, in the form of a graph or image). In all these cases, it is likely that in-depth training will be beneficial - for example, you can encode useful representations of larger, related datasets and use them in your task.

In an extreme case, you may have several neural networks that collectively assimilate a representation and an effective way to reuse it on small sets of samples. This is called one-shot learning, and it has been successfully applied in various fields with multidimensional data, including machine vision and the discovery of new drugs.

First-time training networks in the discovery of new drugs. Illustration from an article by Altae-Tran et al. ACS Cent. Sci. 2017

Deep learning is not a universal solution to all problems.

The second misconception that you often hear is real hype. Many beginning practitioners expect deep networks to give them a fantastic leap in productivity gains simply because it happens in other areas. Others are impressed by the stunning successes of deep learning in modeling and manipulating images, sound and linguistics - in the three types of data closest to humans - and they plunge headlong into this area, trying to teach the latest fashionable architecture of competitive neural networks. This excitement manifests itself in different ways.

Deep learning has become an undeniable force in machine learning and an important tool in the arsenal of any data model developer. Its popularity has led to the creation of important frameworks such as TensorFlow and PyTorch, which are incredibly useful even outside of in-depth training. The story of his transformation from an underdog into a superstar inspired researchers to revise other methods previously considered unintelligible, such as evolutionary strategies and reinforced learning. But this is by no means a panacea. In addition to thinking about the absence of a freebie, in principle, I can say that deep learning models can have important nuances, require careful handling, and sometimes very expensive search for hyperparameters, settings and testing (see the article below for more on this). In addition, there are many cases

Deep learning is more than`.fit()`

There is another aspect of deep learning models, which, according to my observations, is misunderstood from the point of view of other areas of machine learning. Most textbooks and introductory materials on deep learning describe these models as composed of hierarchically related layers of nodes, where the first layer receives the input signal and the last layer produces the output signal, and you can train them using some form of stochastic gradient descent. Sometimes it may be briefly mentioned how stochastic gradient descent works and what the back propagation of error is. But the lion's part of the explanation is devoted to the rich variety of types of neural networks (convolutional, recurrent, etc.). The optimization methods themselves receive little attention, and this is very bad,this post by Ferenc Khuzhar and his scientific article , which is mentioned there), and knowledge of how to optimize the parameters of these methods and how to separate the data for their effective use is extremely important for obtaining good convergence in a reasonable time.

Why stochastic gradients are so important is still unknown, but experts here and there make different assumptions on this subject. One of my favorites is the interpretation of these methods as part of calculating Bayesian inference. In fact, every time you perform some kind of numerical optimization, you calculate the Bayesian output with certain assumptions. In the end, there is a whole area called probabilistic numeric , which literally grew out of such an interpretation.

Stochastic gradient descent (SGD) is no different, and recent scientific studies suggest that this procedure is actually a Markov chain, which, under certain assumptions, demonstrates a stationary distribution, and it can be considered as a kind of variational approximation to a posteriori probability. So if you stop your SGD and accept the final parameters, then in reality you will get samples from this approximate distribution.

This idea seemed very bright to me. It explains a lot, because the optimizer parameters (in this case, the learning speed) now get a lot more sense. Such an example: how can you change the parameter of the learning speed in a stochastic gradient descent, so the Markov chain becomes unstable until it finds a wide local minimum covering samples in a large region; Thus, you increase the variance of the procedure. On the other hand, if the learning rate is reduced, the Markov chain slowly approaches a narrower local minimum until it converges in a very narrow region; thus, you increase the skew to a certain area.

Another parameter in SGD, the batch size, also controls in which type of area the algorithm will converge: in wider areas for smaller packages or in sharper areas with larger packages.

SGD prefers a wide or narrow local minimum, depending on the learning speed or packet size.

Such complexity means that optimizers of deep neural networks come to the fore: this is the core of the model, as important as the layer architecture. Many other machine learning models do not. Linear models (even regularization, such as LASSO) and SVM are convex optimization problems that do not have such subtleties and have only one solution. This is why specialists who come from other areas and / or use tools like scikit-learn cannot understand why it is impossible to find a very simple API with a method .fit()(although there are some tools like skflow that try to reduce simple neural networks to a signature.fit(), but this approach seems a little wrong to me, because the meaning of deep learning is its flexibility).

When you should not use deep learning

So, when is deep learning not the best way to solve a problem? From my point of view, here are the main scenarios where deep learning is more likely to be an obstacle.

Low budget or minor issues

Neural networks are very flexible models, with many architectures and node types, optimizers and regularization strategies. Depending on the application, your model may have convolutional layers (how wide? With which pooling?) Or a recurrent structure (with or without gates?). It can be really deep (Hourglass, Siamese or one of many other architectures) or with just a few hidden layers (how many blocks to use?). It may contain linear rectification units or other activation functions. It can turn off part of the neurons during training through the dropout (in which layers? How many neurons do you turn off?) And you should perhaps order the weights (l1, l2 or something more strange?). This is not a complete list, there are many other types of nodes, links, and even loss functions - all this can be tested.

It will take a lot of time to experience such a large number of possible hyperparameters and architectures, even when training one instance of a large neural network. Google recently boasted that its AutoML pipeline is able to automatically select the most optimal architecture, which is very impressive, but it requires more than 800 GPUs to work around the clock for several weeks, and this is not available to everyone. The bottom line is that learning deep neural networks is expensive, both in computing and in debugging. Such costs do not make sense to solve everyday forecasting problems, so the neural network ROI for such problems may turn out to be too small, even when setting up small neural networks. Even if you have a large budget and an important task, there is no reason not to try alternative methods first as a starting level.

Interpretation and explanation of a wide audience of model parameters and attributes

Deep neural networks are also notorious as “black boxes” with high prediction efficiency but low interpretability. Although recently, many tools have been created, such as cards of salience and differences in activation . They work well in some areas, but are not applicable in all applications. Basically, these tools are useful when you want to verify that a neural network is not fooling you by remembering a data set or processing certain dummy signs. But it is still difficult to interpret the contribution of each feature to the overall solution of the neural network.

In such conditions, nothing can be compared with linear models, since there the acquired coefficients are directly related to the result. This is especially important when it is necessary to explain such interpretations to a wide audience, which will make important decisions based on them. For example, doctors need to integrate all kinds of different data to get a diagnosis. The simpler and clearer the relationship between the variable and the result, the better the physician can take this variable into account, eliminating the possibility of underestimating or overestimating its value. Moreover, there are cases where interpretability is more important than model accuracy (usually deep learning is unparalleled in accuracy). So, legislators may be interested in what effect some demographic variables have, for example, on mortality. And they may be more interested in direct approximation, not prediction accuracy. In both cases, deep learning is inferior to simpler, more transparent networks.

Determination of causal relationships

An extreme case of model interpretability is when we try to define a mechanistic model, that is, a model that really captures the phenomenon behind the data. A good example would be an attempt to predict how two molecules (e.g., drugs, proteins, nucleic acids, etc.) will react in a particular cellular environment. Or hypothesizing how a certain marketing strategy will affect sales. According to experts in this field, in reality, nothing can compete with the good old Bayesian methods. This is the best (though not ideal) way to present and draw conclusions about causal relationships. Vicarious recently published a good scientific paper.. It shows why in problems with video games such a more principled approach gives better generalizations than deep learning.

Training on "unstructured" signs

Perhaps this is a controversial point. I found out that there is one area in which deep learning works great. This is a search for useful data representations for a specific task. A very good illustration is the aforementioned inclusion of words. The natural language has a rich and complex structure, it can be approximated using neural networks that take into account the context: each word is represented as a vector encoding the context in which the word is used most often. Using information on the inclusion of words obtained as a result of training on a large corpus of words, processing of a natural language can sometimes show significantly greater efficiency in a specific task on another corpus of words. However, this model may be completely useless if the case is completely unstructured.

Say you are trying to classify objects by examining unstructured keyword lists. Since keywords are not used in any particular structure (such as sentences), it is unlikely that including words here will greatly help. In this case, the data is a “bag of words” type. Such a representation is likely to be quite enough to complete the task. However, one can argue that the inclusion of words is calculated relatively simply using pre-trained models, and they can better capture the similarity of keywords. However, I would still prefer to start with a bag of words and see if she can make good predictions. After all, each dimension of the word bag is easier to interpret than the corresponding word inclusion layer.

The future is deep

The field of deep learning is on the rise, well funded and staggeringly fast. By the time you read the scientific article published at the conference, two or three iterations of improved models based on this article, which can already be considered obsolete, are likely to be released. An important warning is connected with all the arguments that I expressed above: in fact, in the near future, in-depth training may turn out to be super-useful for all the mentioned scenarios. Tools for interpreting deep learning models for images and individual sentences are getting better. Latest software like Edwardcombines Bayesian modeling and frameworks of deep neural networks, allowing quantitative assessment of the uncertainty of neural network parameters and simple Bayesian inference using probabilistic programming and automated variational inference. In the longer term, one can expect a reduction in the vocabulary for modeling: it will capture the salient properties that the neural network may have, and thus reduce the space of those parameters that should be tried. So do not forget to look at your arXiv feed, this article of mine may become out of date in a month or two.

Edward combines probabilistic programming with TensorFlow, allowing the creation of models that simultaneously use deep learning and Bayesian methods. Illustration: Tran et al. ICLR 2017

Tags: