Hessian-Free optimization with TensorFlow

    Good afternoon! I want to talk about the optimization method known as Hessian-Free or Truncated Newton (Truncated Newton Method) and about its implementation using the deep learning library - TensorFlow. It takes advantage of second-order optimization methods and there is no need to read the matrix of second derivatives. This article describes the HF algorithm itself, as well as its work for training the direct distribution network on MNIST and XOR datasets.




    A little bit about optimization methods


    Learning a neural network involves minimizing a loss function in relation to its weights, which can be a lot. Therefore, there are many optimization methods to solve this problem.

    Gradient Descent


    Gradient descent is the simplest method for sequentially finding the minimum of a differentiable function. $ f $(in the case of neural networks, this is a function of cost). Having a few options$ x $ (network weights) and differentiating a function with respect to them, we obtain a vector of partial derivatives or a gradient vector:

    $ \ nabla f (x) = \ langle \ frac {\ delta f} {\ delta x_1}, \ frac {\ delta f} {\ delta x_2}, ..., \ frac {\ delta f} {\ delta x_n} \ rangle $

    The gradient always points in the direction of the maximum growth of the function. If we move in the opposite direction (i.e.$ - \ nabla f (x) $) then over time we will come to a minimum, which is what we needed. The simplest gradient descent algorithm:

    1. Initialization: randomly select options $ x_0 $
    2. Calculate the gradient: $ \ nabla f (x_i) $
    3. Change the parameters in the direction of a negative gradient: $ x_ {i + 1} = x_i - \ alpha \ nabla f (x_i) $where $ \ alpha $ - some parameter of the learning rate
    4. Repeat the previous steps until the gradient is close enough to zero

    Gradient descent is a fairly simple and well-proven optimization method, but there is also a minus - it is of the first order, which means that the first derivative with respect to the cost function is taken. This in turn imposes some limitations: we mean that our cost function locally looks like a plane and does not take into account its curvature.

    Newton's Method


    But what if we take and use the information that the second derivatives of the cost function give us? The best known optimization method using second derivatives is the Newton method. The main idea of ​​this method is to minimize the quadratic approximation of the cost function. What does this mean? Let's figure it out.

    Take the one-dimensional case. Suppose we have a function:$ f: \ mathbb {R} \ to \ mathbb {R} $. To find the minimum point, we need to find the zero of its derivative, because we know:$ f '(x) = 0 $ is at a minimum $ f (x) $. We approximate the function$ f $ Taylor expansion of the second order:

    $ f (x + \ delta) \ approx f (x) + f '(x) \ delta + \ frac {1} {2} \ delta f' '(x) \ delta $

    We want to find one $ \ delta $, what $ f (x + \ delta) $will be the minimum. To do this, we take the derivative with respect to$ x $ and equate to zero:

    $ \ frac {d} {dx} \ bigg (f (x) + f '(x) \ delta + \ frac {1} {2} \ delta f' '(x) \ delta \ bigg) = f' (x ) + f '' (x) \ delta = 0 \ implies \ delta = - \ frac {f '(x)} {f' '(x)} $

    If $ f $quadratic function this will be an absolute minimum. If we want to find the minimum iteratively, then we take the initial$ x_0 $ and update it according to this rule:

    $ x_ {n + 1} = x_n- \ frac {f '(x_n)} {f' '(x_n)} = x_n- (f' '(x_n)) ^ {- 1} f' (x_n) $

    Over time, the solution will come to a minimum.

    Consider the multidimensional case. Suppose we have a multidimensional function$ f: \ mathbb {R ^ n} $ then:

    $ f '(x) \ to \ nabla f (x) $

    $ f '' (x) \ to H (f) (x) $

    Where $ H (f) (x) $- Hessian (Hessian) or a matrix of second derivatives. Based on this, to update the parameters we have the following formula:

    $ x_ {n + 1} = x_n- (H (f) (x)) ^ {- 1} \ nabla f (x_n) $


    Problems with the Newton Method


    As we can see, the Newton method is a second-order method and will work better than a normal gradient descent, because instead of moving to a local minimum at each step, it moves to a global minimum if we assume that the function $ f $quadratic and the expansion of the second order in a Taylor series is its good approximation.

    But this method has one big minus . To optimize the cost function, you must find the Hessian matrix or Hessian$ H $. Put$ \ mathbb {x} $ Is the vector of parameters, then:

    $ H (f) (\ mathbf {x}) = \ begin {pmatrix} \ frac {\ delta f} {\ delta x_1 \ delta x_1} & \ frac {\ delta f} {\ delta x_1 \ delta x_2} & \ cdots & \ frac {\ delta f} {\ delta x_1 \ delta x_n} \\ \ frac {\ delta f} {\ delta x_2 \ delta x_1} & \ frac {\ delta f} {\ delta x_2 \ delta x_2 } & \ cdots & \ frac {\ delta f} {\ delta x_2 \ delta x_n} \\ \ vdots & \ vdots & \ ddots & \ vdots \\ \ frac {\ delta f} {\ delta x_m \ delta x_1} & \ frac {\ delta f} {\ delta x_m \ delta x_2} & \ cdots & \ frac {\ delta f} {\ delta x_m \ delta x_n} \ end {pmatrix} $

    As you can see, the Hessian is a matrix of second derivatives of size $ n \ times n $ and what would it take to calculate $ O (n ^ 2) $computing operations, which can be very critical for networks with hundreds or thousands of parameters. In addition, to solve the optimization problem using the Newton method, it is necessary to find the inverse Hessian matrix$ H ^ {- 1} (f) (x) $, for this it should be positively defined for all $ n $.

    A positive definite matrix.
    Matrix $ \ mathbf A $ dimensions $ n \ times n $ called non-negative definite if it satisfies the condition: $ \ mathbf v ^ T \ mathbf A \ mathbf v \ geq 0 \; \; \;  \ forall \; \; \;  \ mathbf v \ in \ mathbb {R ^ n} $. If, in this case, a strict inequality holds, then the matrix is ​​called positive definite. An important property of such matrices is their nonsingularity, i.e. the existence of an inverse matrix$ \ mathbf A ^ {- 1} $.



    Hessian-Free Optimization


    The main idea of ​​HF optimization is that we take Newton's method as the basis, but use a more suitable way to minimize the quadratic function. But first, we introduce the basic concepts that will be needed in the future.
    Let be$ \ theta = (\ mathbf W, \ mathbf b) $ - network parameters, where $ \ mathbf W $ - matrix of weights (weights), $ \ mathbf b $ biases vector, then we call the network output: $ f (x, \ theta) $where $ x $ - input vector. $ L (t, f (x, \ theta)) $ - loss function $ t $- target value. And we will define the function that we will minimize as the average of losses for all training examples (training batch)$ S $:

    $ h (\ theta) = \ frac {1} {| S |} \ sum _ {(x, y) \ in S} L (t, f (x, \ theta)) $

    Next, according to Newton's method, we define a quadratic function obtained by expanding into a Taylor series of the second order:

    $ h (\ theta + \ delta) \ equiv M (\ delta) = h (\ theta) + \ nabla h (\ theta) ^ T \ delta + \ frac {1} {2} \ delta ^ TH \ delta \ qquad \ qquad (1) $

    Further, taking the derivative with respect to $ \ delta $ from the formula above and equating it to zero we get:

    $ \ nabla M (\ delta) = \ nabla h (\ theta) + H \ delta = 0 \ qquad \ qquad (2) $

    To find $ \ delta $ we will use the conjugate gradient method.

    Conjugate gradient method


    The conjugate gradient method (CG) is an iterative method for solving systems of linear equations of the type: $ \ mathbf A \ mathbf x = \ mathbf b $.

    Brief CG Algorithm:
    Input:$ \ mathbf b $, $ \ mathbf A $, $ \ mathbf x_0 $, $ i = 0 $- CG algorithm step
    Initialization:
    1. $ \ mathbf r_0 = \ mathbf b - \ mathbf A \ mathbf x_0 $ - error vector (residual)
    2. $ \ mathbf d_0 = \ mathbf r_0 $ - vector of search direction

    We repeat until the stop condition is satisfied:
    1. $ \ alpha_i = \ frac {\ mathbf r_i ^ T \ mathbf r_i} {\ mathbf d_i ^ T \ mathbf A \ mathbf d} $
    2. $ \ mathbf x_ {i + 1} = \ mathbf x_i + \ alpha_i \ mathbf d_i $
    3. $ \ mathbf r_ {i + 1} = \ mathbf r_i - \ alpha_i \ mathbf A \ mathbf d_i $
    4. $ \ beta_i = \ frac {\ mathbf r_ {i + 1} ^ T \ mathbf r_ {i + 1}} {\ mathbf r_i ^ T \ mathbf r_i} $
    5. $ \ mathbf d_ {i + 1} = \ mathbf r_ {i + 1} + \ beta_i \ mathbf d_i $
    6. $ i = i + 1 $

    Now, using the conjugate gradient method, we can solve equation (2) and find $ \ delta $which will minimize (1). In our case:$ \ mathbf A \ equiv H, \ mathbf b \ equiv - \ nabla h (\ theta), \ mathbf x \ equiv \ delta $.
    Stop the CG algorithm. You can stop the conjugate gradient method based on various criteria. We will do this based on relative progress in optimizing the quadratic function$ M $:

    $ s_i = \ frac {M (\ delta_i) -M (\ delta_ {iw})} {M (\ delta_i) -M (0)} \ qquad \ qquad (3) $

    Where $ w $ - the size of the "window" by which we will consider the value of progress, $ w = max (10, i / 10) $. The stop condition is:$ s_i <0.0001 $.
    And now you can see that the main feature of HF optimization is that we do not need to find the Hessian directly, but just find the result of its product by a vector.

    Hessian multiplication by vector


    As mentioned earlier, the charm of this method is that we do not need to count the Hessians directly. It is only necessary to calculate the result of the product of the matrix of second derivatives by a vector. You can imagine$ H (\ theta) v $ as a derivative of $H(\theta)$ towards $v$:

    $H(\theta)v=\lim_{\epsilon\to0}\frac{\nabla h(\theta+\epsilon v)-\nabla h(\theta)}{\epsilon}$

    But the use of this formula in practice can cause a number of problems associated with the calculation with a sufficiently small $\epsilon$. Therefore, there is a method for calculating the exact product of a matrix by a vector. We introduce the differential operator$R_vx$. It denotes a derivative of some magnitude$x$dependent $\theta$, towards $v$:

    $R_vx=\lim_{\epsilon\to0}\frac{x(\theta+\epsilon v)-x(\theta)}{\epsilon}=\frac{\delta x}{\delta\theta}v$

    This shows that to calculate the product of the Hessian by a vector, it is necessary to calculate:

    $Hv=R(\nabla h(\theta))\qquad\qquad(4)$



    Some improvements to HF optimization


    1. Generalized Newton-Gauss matrix (generalized Gauss-Newton matrix).
    The uncertainty of the Hessian matrix is ​​a problem for optimizing non-convex (non-convex) functions, it can lead to the absence of a lower bound for a quadratic function$M$and as a result, the impossibility of finding its minimum. This problem can be solved in many ways. For example, introducing a confidence interval will limit optimization or damping based on a fine that adds a positive semi-definite component to the curvature matrix$H$and makes her positive definite.

    Based on practical results, the best way to solve this problem is to use the Newton-Gauss matrix$G$ instead of the Hessian matrix:

    $G=\frac{1}{|S|}\sum_{(x,y)\in S}J^TH_LJ\qquad\qquad(5)$

    Where $J$ - Jacobian, $H_L$ - matrix of second derivatives of the loss function $L(t, f(x,\theta))$.
    To find the product of the matrix$G$ on vector $v$: $Gv=J^TH_LJv$, first we find the product of the Jacobian by the vector:

    $Jv=R_v(f(x,\theta))$

    then we calculate the product of the vector $Jv$ on the matrix $H_L$ and in the end we multiply the matrix $J$ on the $H_LJv$.

    2. Damping.
    The standard Newton method may poorly optimize strongly nonlinear objective functions. The reason for this may be that in the initial stages of optimization it can be very large and aggressive, since the starting point is far from the minimum point. To solve this problem, dumping is used - a method of changing a quadratic function$M$ or minimization restrictions so that the new $\delta$ will lie within such limits that $M$ will remain a good approximation $h$.
    Regularization of Tikhonov (Tikhonov regularization) or dumping of Tikhonov (Tikhonov damping). (Not to be confused with the term “regularization”, which is commonly used in the context of machine learning) This is the most famous dumping method that adds a quadratic penalty to a function$M$:

    $\hat{M}(\delta)\equiv M(\delta)+\frac{\lambda}{2}\delta^T\delta=f(\theta)+\nabla h(\theta)^T\delta+\frac{1}{2}\delta^T\hat{H}\delta$

    Where $\hat{H}=H+\lambda I$, $\lambda \geq0$- dump parameter. Calculation$Hv$ is made like this:

    $\hat{H}v=(H+\lambda I)v=Hv+\lambda v\qquad\qquad(6)$


    3. Heuristics of Levenberg-Marquardt (Levenberg-Marquardt heuristic).
    Tikhonov dumping is characterized by dynamic tuning of the parameter$\lambda$. Change$\lambda$we will follow the Levenberg-Marquardt rule, which is often used in the context of the LM - method (the optimization method is an alternative to the Newton method). To use LM - heuristics, it is necessary to calculate the so-called reduction ratio:

    $\rho=\frac{h(\theta_k+\delta_k)-h(\delta_k)}{M_k(\delta_k)-M_k(0)}=\frac{h(\theta_k+\delta_k)-h(\delta_k)}{\nabla h(\theta_k)^T\delta_k+\frac{1}{2}\delta_k^TH\delta_k}\qquad\qquad(7)$

    Where $k$ - step number of the HF algorithm, $\delta_k$- the result of the work of CG minimization.
    According to the Levenberg-Marquardt heuristic, we get the update rule$\lambda$:

    $\begin{cases}if\ \rho>3/4\ then\ \lambda\gets(2/3)\lambda\\if\ \rho<1/4\ then\ \lambda\gets(3/2)\lambda\end{cases}\qquad\qquad(8)$


    4. The initial condition for the algorithm of conjugate gradients (preconditioning).
    In the context of HF optimization, we have some reversible transformation matrix$C$with which we change $M$ so that $\delta=C^{-1}\gamma$ and instead $\delta$ minimize $\gamma$. The use of this feature in the CG algorithm requires the calculation of the transformed error vector$y_i=P^{-1}r_i$where $P=C^TC$.

    Brief PCG (Preconditioned conjugate gradient) algorithm:
    Input:$\mathbf b$, $\mathbf A$, $\mathbf x_0$, $P$, $i=0$- CG algorithm step
    Initialization:
    1. $\mathbf r_0 = \mathbf b - \mathbf A\mathbf x_0$ - error vector (residual)
    2. $\mathbf y_0$ - solution of the equation $Py_0=r_0$
    3. $\mathbf d_0 = \mathbf y_0$ - vector of search direction

    We repeat until the stop condition is satisfied:
    1. $\alpha_i=\frac{\mathbf r_i^T\mathbf y_i}{\mathbf d_i^T\mathbf A\mathbf d}$
    2. $\mathbf x_{i+1}=\mathbf x_i + \alpha_i\mathbf d_i$
    3. $\mathbf r_{i+1}=\mathbf r_i - \alpha_i\mathbf A\mathbf d_i$
    4. $\mathbf y_{i+1}$ - solution of the equation $Py_{i+1}=r_i$
    5. $\beta_i=\frac{\mathbf r_{i+1}^T\mathbf y_{i+1}}{\mathbf r_i^T\mathbf y_i}$
    6. $\mathbf d_{i+1}=\mathbf y_{i+1} + \beta_i\mathbf d_i$
    7. $i = i+1$

    Matrix selection $P$quite a trivial task. Also, in practice, the use of a diagonal matrix (instead of a matrix with a full rank) shows rather good results. One of the options for choosing a matrix$P$ - this is the use of the diagonal Fisher matrix (Empirical Fisher Diagonal):

    $P=diag(\bar{F})=\frac{1}{|S|}\sum_{(x,y)\in S}(\nabla (L(t,f(x,\theta)))^2\qquad\qquad(9)$


    5. Initialization of the CG algorithm.
    It’s good practice to initialize the initial$\delta_0$, for the conjugate gradient algorithm, value $\delta_k$found in the previous step of the HF algorithm. In this case, you can use some decay constant:$\delta_0=\epsilon\delta_k, \ \epsilon \in (0, 1)$. It is worth noting that the index$k$ refers to the step number of the HF algorithm, in turn, index 0 in $\delta_0$refers to the initial step of the CG algorithm.

    Complete Hessian-Free Optimization Algorithm:
    Input:$\theta$, $\lambda$ - dump parameter $k$- step of the iteration of the algorithm
    Initialization:
    1. $\delta_0=0$

    The main HF optimization cycle:
    1. Calculate the matrix $P$
    2. We find $\delta_k$ Solving the optimization problem using CG or PCG. $\delta_k=CG(-\nabla h(\theta_k),H,\epsilon\delta_{k-1},P)$
    3. Updating the dumping parameter $\lambda$ using the Levenberg-Marquardt heuristic
    4. $\theta_{k+1}=\theta_k+\alpha\theta_k$, $\alpha$ - learning rate parameter
    5. $k = k+1$

    Thus, the Hessian-Free optimization method allows us to solve the problem of finding the minimum of a large-dimensional function. It does not require finding the Hessian matrix directly.


    Implement HF Optimization on TensorFlow


    The theory is certainly good, but let's try to implement this optimization method in practice and see what comes of it. To write the HF algorithm, I used Python and the TensorFlow deep learning library. After that, as a performance check, I trained a direct distribution network with several layers on XOR and MNIST datasets using the HF method for optimization.

    Implementation (creating a TensorFlow calculation graph) for the conjugate gradient method.

      def __conjugate_gradient(self, gradients):
        """ Performs conjugate gradient method to minimze quadratic equation 
        and find best delta of network parameters.
        gradients: list of Tensorflow tensor objects
            Network gradients.
        return: Tensorflow tensor object
            Update operation for delta.
        return: Tensorflow tensor object
            Residual norm, used to prevent numerical errors. 
        return: Tensorflow tensor object
            Delta loss. """
        with tf.name_scope('conjugate_gradient'):
          cg_update_ops = []
          prec = None
          #вычисление матрицы P по формуле (9)
          if self.use_prec:
            if self.prec_loss is None:
              graph = tf.get_default_graph()
              lop = self.loss.op.node_def
              self.prec_loss = graph.get_tensor_by_name(lop.input[0] + ':0')
            batch_size = None
            if self.batch_size is None:
              self.prec_loss = tf.unstack(self.prec_loss)
              batch_size = self.prec_loss.get_shape()[0]
            else:
              self.prec_loss = [tf.gather(self.prec_loss, i)
                for i in range(self.batch_size)]
              batch_size = len(self.prec_loss)
            prec = [[g**2 for g in tf.gradients(tf.gather(self.prec_loss, i),
              self.W)] for i in range(batch_size)]
            prec = [(sum(tensor) + self.damping)**(-0.75)
              for tensor in np.transpose(np.array(prec))]
          #основной алгоритм сопряженных градиентов
          Ax = None
          if self.use_gnm:
            Ax = self.__Gv(self.cg_delta)
          else:
            Ax = self.__Hv(gradients, self.cg_delta)
          b = [-grad for grad in gradients]
          bAx = [b - Ax for b, Ax  in zip(b, Ax)]
          condition = tf.equal(self.cg_step, 0)
          r = [tf.cond(condition, lambda: tf.assign(r,  bax),
            lambda: r) for r, bax  in zip(self.residuals, bAx)]
          d = None
          if self.use_prec:
            d = [tf.cond(condition, lambda: tf.assign(d, p * r), 
              lambda: d) for  p, d, r in zip(prec, self.directions, r)]
          else:
            d = [tf.cond(condition, lambda: tf.assign(d, r), 
              lambda: d) for  d, r in zip(self.directions, r)]
          Ad = None
          if self.use_gnm:
            Ad = self.__Gv(d)
          else:
            Ad = self.__Hv(gradients, d)
          residual_norm = tf.reduce_sum([tf.reduce_sum(r**2) for r in r])
          alpha = tf.reduce_sum([tf.reduce_sum(d * ad) for d, ad in zip(d, Ad)])
          alpha = residual_norm / alpha
          if self.use_prec:
            beta = tf.reduce_sum([tf.reduce_sum(p * (r - alpha * ad)**2) 
              for r, ad, p in zip(r, Ad, prec)])
          else:
            beta = tf.reduce_sum([tf.reduce_sum((r - alpha * ad)**2) for r, ad 
              in zip(r, Ad)])
          self.beta = beta
          beta = beta / residual_norm
          for i, delta  in reversed(list(enumerate(self.cg_delta))):
            update_delta = tf.assign(delta, delta + alpha * d[i],
              name='update_delta')
            update_residual = tf.assign(self.residuals[i], r[i] - alpha * Ad[i],
              name='update_residual')
            p = 1.0
            if self.use_prec:
              p = prec[i]
            update_direction = tf.assign(self.directions[i],
              p * (r[i] - alpha * Ad[i]) + beta * d[i], name='update_direction')
            cg_update_ops.append(update_delta)
            cg_update_ops.append(update_residual)
            cg_update_ops.append(update_direction)
          with tf.control_dependencies(cg_update_ops):
            cg_update_ops.append(tf.assign_add(self.cg_step, 1))
          cg_op = tf.group(*cg_update_ops)
        dl = tf.reduce_sum([tf.reduce_sum(0.5*(delta*ax) + grad*delta)
          for delta, grad, ax in zip(self.cg_delta, gradients, Ax)])
        return cg_op, residual_norm, dl
    

    Code for calculating the matrix $P$to find the initial condition (preconditioning) is given below. At the same time, since Tensorflow summarizes the result of calculating gradients over the entire set of training examples presented, I had to twist a little to get the gradient separately for each example, which affected the numerical stability of the solution. Therefore, the use of preconditioning is possible, as they say, at your own peril and risk.

     prec = [[g**2 for g in tf.gradients(tf.gather(self.prec_loss, i),
       self.W)] for i in range(batch_size)]
    

    Calculation of the product of the Hessian by vector (4). In this case, Tikhonov dumping is used (6).

      def __Hv(self, grads, vec):
        """ Computes Hessian vector product.
        grads: list of Tensorflow tensor objects
            Network gradients.
        vec: list of Tensorflow tensor objects
            Vector that is multiplied by the Hessian.
        return: list of Tensorflow tensor objects
            Result of multiplying Hessian by vec. """
        grad_v = [tf.reduce_sum(g * v) for g, v in zip(grads, vec)]
        Hv = tf.gradients(grad_v, self.W, stop_gradients=vec)
        Hv = [hv + self.damp_pl * v for hv, v in zip(Hv, vec)]
        return Hv
    

    When I wanted to use the generalized Newton-Gauss matrix (5), I came across a small problem. Namely, TensorFlow does not know how to count Jacobian’s product as a vector like the other Theano deep learning framework does (Theano has a Rop function specifically designed for this). I had to do an analog operation in TensorFlow.

      def __Rop(self, f, x, vec):
        """ Computes Jacobian vector product.
        f: Tensorflow tensor object
            Objective function.
        x: list of Tensorflow tensor objects
            Parameters with respect to which computes Jacobian matrix.
        vec: list of Tensorflow tensor objects
            Vector that is multiplied by the Jacobian.
        return: list of Tensorflow tensor objects
            Result of multiplying Jacobian (df/dx) by vec. """
        r = None
        if self.batch_size is None:
          try:
            r = [tf.reduce_sum([tf.reduce_sum(v * tf.gradients(f, x)[i])
                 for i, v in enumerate(vec)])
                 for f in tf.unstack(f)]
          except ValueError:
            assert False, clr.FAIL + clr.BOLD + 'Batch size is None, but used '\
              'dynamic shape for network input, set proper batch_size in '\
              'HFOptimizer initialization' + clr.ENDC
        else:
          r = [tf.reduce_sum([tf.reduce_sum(v * tf.gradients(tf.gather(f, i), x)[j]) 
               for j, v in enumerate(vec)])
               for i in range(self.batch_size)]
        assert r is not None, clr.FAIL + clr.BOLD +\
          'Something went wrong in Rop computation' + clr.ENDC
        return r
    

    And then already realize the product of the generalized Newton-Gauss matrix by a vector.

      def __Gv(self, vec):
        """ Computes the product G by vec = JHJv (G is the Gauss-Newton matrix).
        vec: list of Tensorflow tensor objects
            Vector that is multiplied by the Gauss-Newton matrix.
        return: list of Tensorflow tensor objects
            Result of multiplying Gauss-Newton matrix by vec. """
        Jv = self.__Rop(self.output, self.W, vec)
        Jv = tf.reshape(tf.stack(Jv), [-1, 1])
        HJv = tf.gradients(tf.matmul(tf.transpose(tf.gradients(self.loss,
          self.output)[0]), Jv), self.output, stop_gradients=Jv)[0]
        JHJv = tf.gradients(tf.matmul(tf.transpose(HJv), self.output), self.W,
          stop_gradients=HJv)
        JHJv = [gv + self.damp_pl * v for gv, v in zip(JHJv, vec)]
        return JHJv
    

    The function of the main learning process is presented below. First, the quadratic function is minimized using CG / PCG, then the main network weights are updated. The dumping parameter based on the Levenberg-Marquardt heuristic is also adjusted.

      def minimize(self, feed_dict, debug_print=False):
        """ Performs main training operations.
        feed_dict: dictionary
            Input training batch.
        debug_print: bool
            If True prints CG iteration number. """
        self.sess.run(tf.assign(self.cg_step, 0))
        feed_dict.update({self.damp_pl:self.damping})
        if self.adjust_damping:
          loss_before_cg = self.sess.run(self.loss, feed_dict)
        dl_track = [self.sess.run(self.ops['dl'], feed_dict)]
        self.sess.run(self.ops['set_delta_0'])
        for i in range(self.cg_max_iters):
          if debug_print:
            d_info = clr.OKGREEN + '\r[CG iteration: {}]'.format(i) + clr.ENDC
            sys.stdout.write(d_info)
            sys.stdout.flush()
          k = max(self.gap, i // self.gap)
          rn = self.sess.run(self.ops['res_norm'], feed_dict)
          #ранняя остановка для предотвращения численной ошибки
          if rn < self.cg_num_err:
            break
          self.sess.run(self.ops['cg_update'], feed_dict)
          dl_track.append(self.sess.run(self.ops['dl'], feed_dict))
          #ранняя остановка алгоритма, основываясь на формуле (3)
          if i > k:
            stop = (dl_track[i+1] - dl_track[i+1-k]) / dl_track[i+1]
            if not np.isnan(stop) and stop < 1e-4:
              break
        if debug_print:
          sys.stdout.write('\n')
          sys.stdout.flush()
        if self.adjust_damping:
          feed_dict.update({self.damp_pl:0})
          dl = self.sess.run(self.ops['dl'], feed_dict)
          feed_dict.update({self.damp_pl:self.damping})
        self.sess.run(self.ops['train'], feed_dict)
        if self.adjust_damping:
          loss_after_cg = self.sess.run(self.loss, feed_dict)
          #коэффициент уменьшения (7)
          reduction_ratio = (loss_after_cg - loss_before_cg) / dl
          #эвристика Левенберга-Марквардта (8)
          if reduction_ratio < 0.25 and self.damping > self.damp_num_err:
            self.damping *= 1.5
          elif reduction_ratio > 0.75 and self.damping > self.damp_num_err:
            self.damping /= 1.5
    

    Testing HF Optimization


    We will test the written HF optimizer, for this we will use a simple example with an XOR dataset and a more complex one with MNIST dataset. In order to see the learning outcomes and visualize some information, we will use TesnorBoard. I would also like to note that we got a rather complex graph of TensorFlow calculations.


    TensorFlow Computing Graph.

    Network architecture and training on the XOR dataset.
    Let's create a simple network of size: 2 input neurons, 2 hidden and 1 output. As an activation function we will use a sigmoid. As a loss function, we use log-loss.

      #определение функции потерь
      """ Log-loss cost function """
      loss = tf.reduce_mean(( (y * tf.log(out)) + 
        ((1 - y) * tf.log(1.0 - out)) ) * -1, name='log_loss')
      #XOR датасет
      XOR_X = [[0,0],[0,1],[1,0],[1,1]]
      XOR_Y = [[0],[1],[1],[0]]
      #создание оптимизатора
      sess = tf.Session()
      hf_optimizer = HFOptimizer(sess, loss, y_out, 
        dtype=tf.float64, use_gauss_newton_matrix=True)
      init = tf.initialize_all_variables()
      sess.run(init)
      #цикл тренировки
      max_epoches = 100
      print('Begin Training')
      for i in range(max_epoches):
        feed_dict = {x: XOR_X, y: XOR_Y}
        hf_optimizer.minimize(feed_dict=feed_dict)
        if i % 10 == 0:
          print('Epoch:', i, 'cost:', sess.run(loss, feed_dict=feed_dict))
          print('Hypothesis ', sess.run(out, feed_dict=feed_dict))
    

    Now compare the learning results using HF optimization (with the Hessian matrix), HF optimization (with the Newton-Gauss matrix) and the usual gradient descent with the learning speed parameter equal to 0.01. The number of iterations is 100.


    Loss for gradient descent (red line). Loss for HF optimization with Hessian matrix (orange line). Loss for HF optimization with the Newton-Gauss matrix (blue line).

    At the same time, it is clear that HF ​​optimization converges most quickly using the Newton-Gaussian matrix, while for the gradient descent 100 iterations turned out to be very small. In order for the loss function during gradient descent to be comparable with the HF optimization, it took about 100,000 iterations .


    Loss for gradient descent, 100,000 iterations.

    Network architecture and training on the MNIST dataset.
    To solve the problem of handwritten number recognition, we will create a network of size: 784 input neurons, 300 hidden and 10 output. As a function of losses, we will use cross-entropy. The size of the mini-batch served during training is 50.

      with tf.name_scope('loss'):
        one_hot = tf.one_hot(t, n_outputs, dtype=tf.float64)
        xentropy = tf.nn.softmax_cross_entropy_with_logits(labels=one_hot, logits=y_out)
        loss = tf.reduce_mean(xentropy, name="loss")
      with tf.name_scope("eval"):
        correct = tf.nn.in_top_k(tf.cast(y_out, tf.float32), t, 1)
        accuracy = tf.reduce_mean(tf.cast(correct, tf.float64))
      n_epochs = 10
      batch_size = 50
      with tf.Session() as sess:
        """ Initializing hessian free optimizer """
        hf_optimizer = HFOptimizer(sess, loss, y_out, dtype=tf.float64, batch_size=batch_size, 
          use_gauss_newton_matrix=True)
        init = tf.global_variables_initializer()
        init.run()
        #основной процесс обучения
        for epoch in range(n_epochs):
          n_batches = mnist.train.num_examples // batch_size
          for iteration in range(n_batches):
            x_batch, t_batch = mnist.train.next_batch(batch_size)
            hf_optimizer.minimize({x: x_batch, t: t_batch})
            if iteration%10==0:
              print('Batch:', iteration, '/', n_batches)
              acc_train = accuracy.eval(feed_dict={x: x_batch, t: t_batch})
              acc_test = accuracy.eval(feed_dict={x: mnist.test.images,
                            t: mnist.test.labels})
              print('Loss:', sess.run(loss, {x: x_batch, t: t_batch}))
              print('Target', t_batch[0])
              print('Out:', sess.run(y_out_sm, {x: x_batch, t: t_batch})[0])
              print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)
          acc_train = accuracy.eval(feed_dict={x: x_batch, t: t_batch})
          acc_test = accuracy.eval(feed_dict={x: mnist.test.images,
                            t: mnist.test.labels})
          print(epoch, "Train accuracy:", acc_train, "Test accuracy:", acc_test)
    

    As in the case of XOR, we compare the learning results using HF optimization (with the Hessian matrix), HF optimization (with the Newton-Gauss matrix) and the usual gradient descent with the learning speed parameter equal to 0.01. The number of iterations is 200, i.e. if the size of the mini-batch is 50, then 200 is not a complete era (not all examples from the training set are used). I did this in order to test everything faster, but even from this a general trend is visible.


    The figure on the left is the accuracy for the test sample. The figure on the right is the accuracy for the training sample. Accuracy for gradient descent (red line). Accuracy for HF optimization with Hessian matrix (orange line). Accuracy for HF optimization with a Newton-Gaussian matrix (blue line).


    Loss for gradient descent (red line). Loss for HF optimization with Hessian matrix (orange line). Loss for HF optimization with the Newton-Gauss matrix (blue line).

    As can be seen from the figures above, HF optimization with the Hessian matrix does not behave very stably, but in the end it will still converge when learning with several eras. The best result is shown by HF optimization with the Newton-Gauss matrix.


    One complete era of learning. The figure on the left is the accuracy for the test sample. The figure on the right is the accuracy for the training sample. Accuracy for gradient descent (turquoise line). Loss for HF optimization with the Newton-Gaussian matrix (pink line).


    One complete era of learning. Loss for gradient descent (turquoise line). Loss for HF optimization with the Newton-Gaussian matrix (pink line).

    When using the method of conjugate gradients with the initial condition for the algorithm of conjugate gradients (preconditioning), the calculations themselves slowed down significantly and converged no faster than with normal CG.


    Loss for HF optimization using PCG algorithm.

    From all these graphs, you can see that the best result was shown by HF optimization using the Newton-Gauss matrix and the standard conjugate gradient method.
    The full code can be viewed on GitHub .


    Summary


    As a result, an implementation of the HF algorithm in Python was created using the TensorFlow library. During the creation, I encountered some problems when implementing the main features of the algorithm, namely: support for the Newton-Gaussian matrix and preconditioning. This was due to the fact that TensorFlow is not as flexible a library as we would like and is not very designed for research. For experimental purposes, it is still better to use Theano, as it gives more freedom. But I initially decided for myself to do all this with TensorFlow. The program was tested and it was possible to see that the HF algorithm using the Newton-Gaussian matrix gives the best results.


    Sources


    In this article, the theoretical aspects of Hessian - Free optimization are described quite briefly so that you can understand the main essence of the algorithms. If a more detailed description of the material is needed, then I quote the sources from where I took the basic theoretical information, on the basis of which Python made the implementation of the HF method.

    1) Training Deep and Recurrent Networks with Hessian-Free Optimization (James Martens and Ilya Sutskever, University of Toronto) - a complete description of HF - optimization.
    2) Deep learning via Hessian-free optimization (James Martens, University of Toronto) - an article with the results of using HF - optimization.
    3) Fast Exact Multiplication by the Hessian (Barak A. Pearlmutter, Siemens Corporate Research)- a detailed description of the multiplication of the Hessian matrix by a vector.
    4) An introduction to the Conjugate Gradient Method without the Agonizing Pain (Jonathan Richard Shewchuk, Carnegie Mellon University) - a detailed description of the conjugate gradient method.

    Also popular now: