Attention for dummies and implementation in Keras

About articles on artificial intelligence in Russian

Despite the fact that the Attention mechanism is described in English literature, I still have not seen a decent description of this technology in the Russian-speaking sector. There are many articles on Artificial Intelligence (AI) in our language. However, those articles that were found reveal only the simplest AI models, for example, convolution networks, generative networks. However, according to cutting-edge latest developments in the field of AI, there are very few articles in the Russian-speaking sector.

The lack of articles in Russian on the latest developments became a problem for me when I entered the topic, studied the current state of affairs in the field of AI. I know English well, I read articles in English on AI topics. However, when a new concept or a new principle of AI comes out, its understanding in a foreign language is painful and long. Knowing English, to penetrate into a non-native in a complex object is still worth much more time and effort. After reading the description, you ask yourself the question: how many percent do you understand? If there was an article in Russian, I would understand 100% after the first reading. This happened with generative networks, for which there is an excellent series of articles: after reading everything became clear. But in the world of networks, there are many approaches that are described only in English and which had to be dealt with for days.

I am going to periodically write articles in my native language, bringing knowledge into our language field. As you know, the best way to understand a topic is to explain it to someone. So who else but I should start a series of articles on the most modern, complex, advanced architectural AI. By the end of the article, I myself will understand a 100% approach, and it will be useful for someone who reads and improves their understanding (by the way, I love Gesser, but better ** Blanche de bruxelles **).

When you understand the subject, there are 4 levels of understanding:

  1. you understand the principle and the inputs and outputs of the Algorithm / Level
  2. you understand the gathering exits and in general terms how it works
  3. you understand all of the above, as well as the device of each network level (for example, in the VAE model you understood the principle, and you also understood the essence of the reparameterization trick)
  4. I understood everything, including every level, I also understood why it all learns, and at the same time I’m able to select hyper parameters for my task, rather than copy-paste ready-made solutions.

For new architectures, the transition from level 1 to level 4 is often difficult: the authors emphasize that they are closer describing various important details superficially (did they understand them themselves?). Or your brain does not contain any constructions, so even after reading the description it did not decipher and did not turn into skills. This happens if during your student years you slept in the same matan lesson, after a night party  where you gave the right mat. apparatus. And just here we need articles in our native language that reveal the nuances and subtleties of each operation.

Attention concept and application

The above is a scenario of levels of understanding. To parse Attention, let's start at level one. Before describing the inputs and outputs, we will analyze the essence: on which basic concepts, understandable even to a child, this concept is based. In the article we will use the English term Attention, because in this form it is also a call to the Keras library function (it is not directly implemented in it, an additional module is required, but more on that below). To read further, you must have an understanding of the Keras and python libraries, because the source code will be provided.

Attention translates from English as “attention”. This term correctly describes the essence of the approach: if you are a motorist and the traffic police general is shown in the photo, you intuitively attach importance to it, regardless of the context of the photo. You are likely to take a closer look at the general. You strain your eyes, look at the shoulder straps carefully: how many stars he has there specifically. If the general is not very tall, ignore him. Otherwise, consider it as a key factor in making decisions. This is how our brain works. In Russian culture, we have been trained by generations to pay attention to high ranks, our brain automatically puts high priority on such objects.

Attention is a way to tell the network what you should pay more attention to, that is, to report the probability of a particular outcome depending on the state of the neurons and the input data. The Attention layer implemented in Keras itself identifies factors based on the training set, attention to which reduces the network error. Identification of important factors is carried out through the method of back propagation of errors, similar to how this is done for convolution networks.

In training, Attention demonstrates its probabilistic nature. The mechanism itself forms a matrix of importance scales. If we had not trained Attention, we could have set the importance, for example, empirically (the general is more important than the ensign). But when we train a network on data, importance becomes a function of the probability of a particular outcome, depending on the data received at the input of the network. For example, if we met a general living in Tsarist Russia, then the probability of getting gauntlets would be high. Having ascertained this, it would be possible through several personal meetings, collecting statistics. After that, our brain will put the appropriate weight on the fact of the meeting of this subject and put markers on shoulder straps and stripes. It should be noted that the set marker is not a probability: now the meeting of the general will entail completely different consequences for you than then, in addition, the weight may be more than one. But, weight can be reduced to probability by normalizing it.

The probabilistic nature of the Attention mechanism in learning is manifested in machine translation tasks. For example, let us inform the network that when translating from Russian into English, the word Love is translated in 90% of cases as Love, in 9% of cases as Sex, in 1% of cases as otherwise. The network immediately marks many options, showing the best quality of training. When translating, we tell the network: "when translating the word love, pay special attention to the English word Love, also see if it can still be Sex."

The Attention approach is applied to work with text, as well as sound and time series. For text processing, recurrent neural networks (RNN, LSTM, GRU) are widely used. Attention can either complement them or replace them, moving the network to simpler and faster architectures.

One of the most famous uses of Attention is to use it in order to abandon the recurrence network and move to a fully connected model. Recurrent networks have a series of shortcomings: the inability to provide training on the GPU, fast-oncoming retraining. Using the Attention mechanism, we can build a network capable of learning sequences on the basis of a fully connected network, train it on the GPU, use droput.

Attention is widely used to improve the performance of recurrence networks, for example, in the field of translation from language to language. When using the encoding / decoding approach, which is often used in modern AI (for example, variational auto-encoders). When an Attention layer is added between the encoder and decoder, the result of the network operation noticeably improves.

In this article, I do not cite specific network architectures using Attention; this will be the subject of separate work. A listing of all the possible uses of attention is worthy of a separate article.

Implementing Attention in Keras Out of the Box

When you understand what kind of approach, it’s very useful to learn the basic principle. But often full understanding comes only by looking at a technical implementation. You see the data streams that make up the function of the operation, it becomes clear what exactly is calculated. But first you need to run it and write “Attention hello word”.

Attention is currently not implemented in Keras itself. But there are already third-party implementations, such as attention-keras, which can be installed with github. Then your code will become extremely simple:

from attention_keras.layers.attention import AttentionLayer
attn_layer = AttentionLayer(name='attention_layer')
attn_out, attn_states = attn_layer([encoder_outputs, decoder_outputs])

This implementation supports the Attention scale visualization function. Having trained Attention, you can get a matrix signaling, which, according to the network, is especially important about this type (picture from github from the attention-keras library page).

Basically, you don’t need anything else: include this code in your network as one of the levels and enjoy learning your network. Any network, any algorithm is designed in the first stages at a conceptual level (like the database, by the way), after which the implementation is specified in a logical and physical representation prior to implementation. This design method has not yet been developed for neural networks (oh yes, this will be the topic of my next article). You don’t understand how convolution layers work inside? The principle is described, you use them.

Keras implementation of Attention low

To finally understand the topic, below we will analyze in detail the implementation of Attention under the hood. The concept is good, but how exactly does it work and why is the result obtained exactly as stated?

The simplest implementation of the Attention mechanism in Keras takes only 3 lines:

inputs = Input(shape=(input_dims,))
attention_probs = Dense(input_dims, activation='softmax', name='attention_probs')(inputs)
attention_mul = merge([inputs, attention_probs], output_shape=32, name='attention_mul', mode='mul'

In this case, the Input layer is declared in the first line, then comes a fully connected layer with the softmax activation function with the number of neurons equal to the number of elements in the first layer. The third layer multiplies the result of the fully connected layer by the input data element by element.

Below is the whole Attention class, which implements a slightly more complex self-attention mechanism, which can be used as a full-fledged level in the model; the class inherits the Keras layer class.

# Attention
class Attention(Layer):
      def __init__(self, step_dim,
                   W_regularizer=None, b_regularizer=None,
                   W_constraint=None, b_constraint=None,
                   bias=True, **kwargs):
          self.supports_masking = True
          self.init = initializers.get('glorot_uniform')
          self.W_regularizer = regularizers.get(W_regularizer)
          self.b_regularizer = regularizers.get(b_regularizer)
          self.W_constraint = constraints.get(W_constraint)
          self.b_constraint = constraints.get(b_constraint)
          self.bias = bias
          self.step_dim = step_dim
          self.features_dim = 0
          super(Attention, self).__init__(**kwargs)
      def build(self, input_shape):
          assert len(input_shape) == 3
          self.W = self.add_weight((input_shape[-1],),
          self.features_dim = input_shape[-1]
          if self.bias:
              self.b = self.add_weight((input_shape[1],),
              self.b = None
          self.built = True
      def compute_mask(self, input, input_mask=None):
          return None
      def call(self, x, mask=None):
          features_dim = self.features_dim
          step_dim = self.step_dim
          eij = K.reshape(, (-1, features_dim)),                           K.reshape(self.W, (features_dim, 1))), (-1, step_dim))
          if self.bias:
              eij += self.b
          eij = K.tanh(eij)
          a = K.exp(eij)
          if mask is not None:
              a *= K.cast(mask, K.floatx())
          a /= K.cast(K.sum(a, axis=1, keepdims=True) + K.epsilon(), K.floatx())
          a = K.expand_dims(a)
          weighted_input = x * a 
          return K.sum(weighted_input, axis=1)
      def compute_output_shape(self, input_shape):
          return input_shape[0],  self.features_dim

Here we see roughly the same thing that was implemented above through a fully connected Keras layer, only performed through a deeper logic at a lower level. A parametric level (self.W) is created in the function, which is then scalarly multiplied ( by the input vector. The wired logic in this variant is a bit more complicated: shift (if bias parameter is disclosed), hyperbolic tangent, exposure, mask (if specified), normalization are applied to the input vector times self.W, then the input vector is again weighted by the result obtained. I have no description of the logic laid down in this example; I reproduce the operations of reading the code. By the way, please write in the comments if you recognize some kind of mathematical high-level function in this logic.

The class has a parameter "bias" ie bias. If the parameter is activated, then after applying the Dense layer, the final vector will be added to the vector of the layer parameters “self.b”, which will make it possible not only to determine the “weights” for our attention function, but also to shift the attention level by a number. Life example: we are afraid of ghosts, but have never met them. Thus, we make a correction for fear -100 points. That is, only if the fear goes off scale for 100 points, we will make decisions on protecting against ghosts, calling a ghostbusting agency, buying scaring devices, etc.


The Attention mechanism has variations. The simplest Attention option implemented in the class above is called Self-Attention. Self-attention is a mechanism designed to process sequential data, taking into account the context of each timestamp. It is most often used for working with textual information. The self-attention implementation can be taken out of the box by importing the keras-self-attention library. There are other variations of Attention. Studying English-language materials, it was possible to count more than 5 variations.

When writing even this relatively short article, I studied more than 10 English-language articles. Of course, I was not able to download all the data from all these articles into 5 pages, I just made a squeeze in order to create a “guide for dummies”. To understand all the nuances of the Attention mechanism, you need a book of pages 150-200. I really hope that I was able to reveal the basic essence of this mechanism so that those who are just starting to understand machine learning understand how this all works.


  1. Attention mechanism in Neural Networks with Keras
  2. Attention in Deep Networks with Keras
  3. Attention-based Sequence-to-Sequence in Keras
  4. Text Classification using Attention Mechanism in Keras
  5. Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction
  6. How to implement the Attention Layer in Keras?
  7. Attention? Attention!
  8. Neural Machine Translation with Attention

Also popular now: