SciPy Optimization

Tutorial

SciPy (pronounced sai pay) is a math application package based on the Numpy Python extension. With SciPy, an interactive Python session turns into the same comprehensive data processing and prototyping environment for complex systems like MATLAB, IDL, Octave, R-Lab and SciLab. Today I want to briefly tell you how to apply some well-known optimization algorithms in the scipy.optimize package. More detailed and current help on the use of functions can always be obtained using the help () command or using Shift + Tab.

Introduction

In order to save himself and his readers from searching and reading primary sources, the references to the descriptions of the methods will mainly be on Wikipedia. As a rule, this information is sufficient for understanding the methods in general terms and conditions of their use. To understand the essence of mathematical methods, follow the links to more authoritative publications that can be found at the end of each article or in your favorite search engine.

So, the scipy.optimize module includes the implementation of the following procedures:

Conditional and unconditional minimization of scalar functions of several variables (minim) using various algorithms (Nelder – Mead simplex, BFGS, Newton conjugate gradients, COBYLA and SLSQP )
Global optimization (for example: basinhopping , diff_evolution )
Minimization of residual least squares (least_squares) and algorithms for fitting curves with nonlinear least squares (curve_fit)
Minimization of scalar functions of one variable (minim_scalar) and root search (root_scalar)
Multidimensional solvers of the system of equations (root) using various algorithms (hybrid Powell, Levenberg-Marquardt or large-scale methods such as Newton-Krylov ).

In this article we will consider only the first item from this entire list.

Unconditional minimization of the scalar function of several variables

The minim function from the scipy.optimize package provides a common interface for solving the problems of conditional and unconditional minimization of scalar functions of several variables. To demonstrate its work, we need a suitable function of several variables, which we will minimize in different ways.

For these purposes, the Rosenbrock function of N variables is perfect, which has the form:

$f \ left (\ mathbf {x} \ right) = \ sum_ {i = 1} ^ {N-1} [100 \ left (x_ {i + 1} -x_ {i} ^ {2} \ right) ^ {2} + \ left (1-x_ {i} \ right) ^ {2}]$

Although the Rosenbrock function and its Jacobi and Hesse matrices (the first and second derivatives, respectively) are already defined in the scipy.optimize package, we define it ourselves.

import numpy as np
defrosen(x):"""The Rosenbrock function"""return np.sum(100.0*(x[1:]-x[:-1]**2.0)**2.0 + (1-x[:-1])**2.0, axis=0)

For clarity, we will draw in the 3D values of the Rosenbrock function of two variables.

Drawing code

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
from matplotlib import cm
from matplotlib.ticker import LinearLocator, FormatStrFormatter
# Настраиваем 3D график
fig = plt.figure(figsize=[15, 10])
ax = fig.gca(projection='3d')
# Задаем угол обзора
ax.view_init(45, 30)
# Создаем данные для графика
X = np.arange(-2, 2, 0.1)
Y = np.arange(-1, 3, 0.1)
X, Y = np.meshgrid(X, Y)
Z = rosen(np.array([X,Y]))
# Рисуем поверхность
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm)
plt.show()

Knowing in advance that the minimum is 0 when $inline$ , consider examples of how to determine the minimum value of the Rosenbrock function using various scipy.optimize procedures.

Nelder-Mead Simplex Method (Nelder-Mead)

Let there is a starting point x0 in 5-dimensional space. Find the minimum point of the Rosenbrock function closest to it using the Nelder-Mead simplex algorithm (the algorithm is specified as the value of the method parameter):

from scipy.optimize import minimize
x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(rosen, x0, method='nelder-mead',
    options={'xtol': 1e-8, 'disp': True})
print(res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 339
         Function evaluations: 571
[1. 1. 1. 1. 1.]

The simplex method is the simplest way to minimize a clearly defined and fairly smooth function. It does not require the calculation of the derivatives of the function; it is enough to specify only its values. The Nelder-Mead method is a good choice for simple minimization problems. However, since it does not use gradient estimates, it may take longer to search for the minimum.

Powell Method

Another optimization algorithm in which only function values are calculated is Powell's method . To use it, you need to set the method = 'powell' in the minim function.

x0 = np.array([1.3, 0.7, 0.8, 1.9, 1.2])
res = minimize(rosen, x0, method='powell',
    options={'xtol': 1e-8, 'disp': True})
print(res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 19
         Function evaluations: 1622
[1. 1. 1. 1. 1.]

The algorithm of Broyden-Fletcher-Goldfarb-Shanno (BFGS)

To obtain faster convergence to the solution, the BFGS procedure uses the gradient of the objective function. The gradient can be specified as a function or calculated using first-order differences. In any case, usually the BFGS method requires less function calls than the simplex method.

Find the derivative of the Rosenbrock function in analytical form:

$\ frac {\ partial f} {\ partial x_j} = \ sum \ limits_ {i = 1} ^ N 200 (x_i - x_ {i-1} ^ 2) (\ delta_ {i, j} - 2x_ {i -1, j}) - 2 (1 - x_ {i-1}) \ delta_ {i-1, j} =$

$= 200 (x_j - x_ {j-1} ^ 2) - 400x_j (x_ {j + 1} - x_j ^ 2) - 2 (1-x_j)$

This expression is valid for derivatives of all variables except the first and last, which are defined as:

$\ frac {\ partial f} {\ partial x_0} = -400 x_0 (x_1 - x_0 ^ 2) - 2 (1 - x_0),$

$\ frac {\ partial f} {\ partial x_ {N-1}} = 200 (x_ {N-1} - x_ {N-2} ^ 2).$

Let's look at the Python function that computes this gradient:

defrosen_der(x):
    xm = x [1: -1]
    xm_m1 = x [: - 2]
    xm_p1 = x [2:]
    der = np.zeros_like (x)
    der [1: -1] = 200 * (xm-xm_m1 ** 2) - 400 * (xm_p1 - xm ** 2) * xm - 2 * (1-xm)
    der [0] = -400 * x [0] * (x [1] -x [0] ** 2) - 2 * (1-x [0])
    der [-1] = 200 * (x [-1] -x [-2] ** 2)
    return der

The gradient calculation function is specified as the value of the jac parameter of the minim function, as shown below.

res = minimize(rosen, x0, method='BFGS', jac=rosen_der, options={'disp': True})
print(res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 25
         Function evaluations: 30
         Gradient evaluations: 30
[1.00000004 1.0000001  1.00000021 1.00000044 1.00000092]

Algorithm of Conjugate Gradients (Newton)

The Newton conjugate gradient algorithm is a modified Newton method.
Newton's method is based on approximation of a function in a local domain by a second degree polynomial:

$f \ left (\ mathbf {x} \ right) \ approx f \ left (\ mathbf {x} _ {0} \ right) + \ nabla f \ left (\ mathbf {x} _ {0} \ right) \ cdot \ left (\ mathbf {x} - \ mathbf {x} _ {0} \ right) + \ frac {1} {2} \ left (\ mathbf {x} - \ mathbf {x} _ {0} \ right) ^ {T} \ mathbf {H} \ left (\ mathbf {x} _ {0} \ right) \ left (\ mathbf {x} - \ mathbf {x} _ {0} \ right)$

Where $\ mathbf {H} \ left (\ mathbf {x} _ {0} \ right)$ is a matrix of second derivatives (Hesse matrix, Hessian).
If the Hessian is positive definite, then the local minimum of this function can be found by equating the zero gradient of the quadratic form to zero. The result is an expression:

$\ mathbf {x} _ {\ textrm {opt}} = \ mathbf {x} _ {0} - \ mathbf {H} ^ {- 1} \ nabla f$

The inverse hessian is calculated using the conjugate gradient method. An example of using this method to minimize the Rosenbrock function is given below. To use the Newton-CG method, you must define a function that calculates the hessian.
The Hessian of the Rosenbrock function in analytical form is equal to:

$H_ {i, j} = \ frac {\ partial ^ 2 f} {\ partial x_i x_j} = 200 (\ delta_ {i, j} - 2x_ {i-1} \ delta {i-1, j} - 400x_i (\ delta_ {i + 1, j} - 2x_i \ delta {i, j}) - 400 \ delta_ {i, j} (x_ {i + 1} - x_i ^ 2) + 2 \ delta_ {i, j } =$

$= (202 + 1200x_i ^ 2 - 400x_ {i + 1}) \ delta_ {i, j} - 400x_i \ delta_ {i + 1, j} - 400x_ {i-1} \ delta_ {i-1, j}$

Where $i, j \ in \ left [1, N-2 \ right]$ and $i, j \ in \ left [0, N-1 \ right]$ determine matrix $N \ times N$ .

The remaining nonzero elements of the matrix are equal:

$\ frac {\ partial ^ 2 f} {\ partial x_0 ^ 2} = 1200x_0 ^ 2 - 400x_1 +2$

$\ frac {\ partial ^ 2 f} {\ partial x_0 x_1} = \ frac {\ partial ^ 2 f} {\ partial x_1 x_0} = -400x_0$

$\ frac {\ partial ^ 2 f} {\ partial x_ {N-1} x_ {N-2}} = \ frac {\ partial ^ 2 f} {\ partial x_ {N-2} x_ {N-1 }} = -400x_ {N-2}$

$\ frac {\ partial ^ 2 f} {\ partial x_ {N-1} ^ 2} = 200x$

For example, in the five-dimensional space N = 5, the Hessian matrix for the Rosenbrock function has a tape form:

$\ tiny \ mathbf {H} = \ begin {bmatrix} 1200x_ {0} ^ {2} -400x_ {1} +2 & -400x_ {0} & 0 & 0 & 0 \\ -400x_ {0} & 202 + 1200x_ {1} ^ {2} -400x_ {2} & -400x_ {1} & 0 & 0 \\ 0 & -400x_ {1} & 202 + 1200x_ {2} ^ {2} -400x_ {3} & -400x_ {2} & 0 \\ 0 & & -400x_ {2} & 202 + 1200x_ {3} ^ {2} -400x_ {4} & -400x_ {3} \\ 0 & 0 & 0 & -400x_ { 3} & 200 \ end {bmatrix}$

The code that calculates this Hessian along with the code to minimize the Rosenbrock function using the conjugate gradient method (Newton):

defrosen_hess(x):
    x = np.asarray(x)
    H = np.diag(-400*x[:-1],1) - np.diag(400*x[:-1],-1)
    diagonal = np.zeros_like(x)
    diagonal[0] = 1200*x[0]**2-400*x[1]+2
    diagonal[-1] = 200
    diagonal[1:-1] = 202 + 1200*x[1:-1]**2 - 400*x[2:]
    H = H + np.diag(diagonal)
    return H
res = minimize(rosen, x0, method='Newton-CG', 
               jac=rosen_der, hess=rosen_hess,
               options={'xtol': 1e-8, 'disp': True})
print(res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 24
         Function evaluations: 33
         Gradient evaluations: 56
         Hessian evaluations: 24
[1.         1.         1.         0.99999999 0.99999999]

An example with the definition of the function of the product of the Hessian and an arbitrary vector

In real-world problems, calculating and storing the entire Hesse matrix may require considerable time and memory resources. In this case, there is actually no need to specify the Hesse matrix itself, since the minimization procedure requires only a vector equal to the product of the Hessian with another arbitrary vector. Thus, from the computational point of view, it is much preferable to immediately determine the function that returns the result of the Hessian product with an arbitrary vector.

Consider the hess function, which takes the minimization vector as the first argument, and an arbitrary vector as the second argument (along with other arguments of the minimized function). In our case, it is not very difficult to calculate the Hessian product of the Rosenbrock function with an arbitrary vector. If p is an arbitrary vector, then the product $H (x) \ cdot p$ has the form:

$\ mathbf {H} \ left (\ mathbf {x} \ right) \ mathbf {p} = \ begin {bmatrix} \ left (1200x_ {0} ^ {2} -400x_ {1} +2 \ right) p_ {0} -400x_ {0} p_ {1} \\ \ vdots \\ -400x_ {i-1} p_ {i-1} + \ left (202 + 1200x_ {i} ^ {2} -400x_ {i + 1} \ right) p_ {i} -400x_ {i} p_ {i + 1} \\ \ vdots \\ -400x_ {N-2} p_ {N-2} + 200p_ {N-1} \ end {bmatrix }.$

The function that calculates the product of the Hessian and an arbitrary vector is passed as the value of the hessp argument of the minimize function:

defrosen_hess_p(x, p):
    x = np.asarray(x)
    Hp = np.zeros_like(x)
    Hp[0] = (1200*x[0]**2 - 400*x[1] + 2)*p[0] - 400*x[0]*p[1]
    Hp[1:-1] = -400*x[:-2]*p[:-2]+(202+1200*x[1:-1]**2-400*x[2:])*p[1:-1] \
    -400*x[1:-1]*p[2:]
    Hp[-1] = -400*x[-2]*p[-2] + 200*p[-1]
    return Hp
res = minimize(rosen, x0, method='Newton-CG',
               jac=rosen_der, hessp=rosen_hess_p,
               options={'xtol': 1e-8, 'disp': True})

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 24
         Function evaluations: 33
         Gradient evaluations: 56
         Hessian evaluations: 66

Trust Region Algorithm of Conjugate Gradients (Newton)

The poor conditionality of the Hessian matrix and the wrong directions of search can lead to the fact that the Newton conjugate gradient algorithm may be ineffective. In such cases, preference is given to the trust-region method of the conjugate Newton gradients.

Example with the definition of the Hesse matrix:

res = minimize(rosen, x0, method='trust-ncg',
               jac=rosen_der, hess=rosen_hess,
               options={'gtol': 1e-8, 'disp': True})
print(res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 20
         Function evaluations: 21
         Gradient evaluations: 20
         Hessian evaluations: 19
[1. 1. 1. 1. 1.]

An example with the function of the product of Hessian and an arbitrary vector:

res = minimize(rosen, x0, method='trust-ncg', 
                jac=rosen_der, hessp=rosen_hess_p, 
                options={'gtol': 1e-8, 'disp': True})
print(res.x)

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 20
         Function evaluations: 21
         Gradient evaluations: 20
         Hessian evaluations: 0
[1. 1. 1. 1. 1.]

Krylov type methods

Like the trust-ncg method, Krylov-type methods are well suited for solving large-scale problems, since they use only matrix-vector products. Their essence is in solving a problem in a trust region bounded by a truncated Krylov subspace. For uncertain tasks, it is better to use this method, since it uses fewer non-linear iterations due to fewer matrix-vector products per subtask compared to the trust-ncg method. In addition, the solution of the quadratic subtask is more accurate than the trust-ncg method.
Example with the definition of the Hesse matrix:

res = minimize(rosen, x0, method='trust-krylov',
               jac=rosen_der, hess=rosen_hess,
               options={'gtol': 1e-8, 'disp': True})
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 19
         Function evaluations: 20
         Gradient evaluations: 20
         Hessian evaluations: 18
print(res.x)
    [1.1.1.1.1.]

An example with the function of the product of Hessian and an arbitrary vector:

res = minimize(rosen, x0, method='trust-krylov',
               jac=rosen_der, hessp=rosen_hess_p,
               options={'gtol': 1e-8, 'disp': True})
Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 19
         Function evaluations: 20
         Gradient evaluations: 20
         Hessian evaluations: 0
print(res.x)
    [1.1.1.1.1.]

Approximate Decision Algorithm in Trust Area

All methods (Newton-CG, trust-ncg and trust-krylov) are well suited for solving large-scale problems (with thousands of variables). This is due to the fact that the underlying algorithm for conjugate gradients implies an approximate finding of the inverse Hessian matrix. The solution is iterative, without explicit decomposition of the Hessian. Since you only need to define a function for the product of the Hessian and an arbitrary vector, this algorithm is especially good for working with sparse (tape diagonal) matrices. This provides low memory costs and significant time savings.

In medium-sized tasks, the costs of storing and factoring the Hessian are not critical. This means that you can get a solution in fewer iterations by solving the subtasks of the trust area almost exactly. For this, some nonlinear equations are solved iteratively for each quadratic subproblem. Such a solution usually requires 3 or 4 decompositions of the Cholesky Hesse matrix. As a result, the method converges in fewer iterations and requires less computation of the objective function than other implemented methods of the confidence domain. This algorithm implies only the definition of the full Hessian matrix and does not support the ability to use the function of the Hessian product and an arbitrary vector.

An example with minimizing the Rosenbrock function:

res = minimize(rosen, x0, method='trust-exact',
               jac=rosen_der, hess=rosen_hess,
               options={'gtol': 1e-8, 'disp': True})
res.x

Optimization terminated successfully.
         Current function value: 0.000000
         Iterations: 13
         Function evaluations: 14
         Gradient evaluations: 13
         Hessian evaluations: 14
array([1., 1., 1., 1., 1.])

On this, perhaps, we will stop. In the next article I will try to tell the most interesting things about conditional minimization, the application of minimization in solving approximation problems, minimization of a function of one variable, arbitrary minimizers, and the search for the roots of a system of equations using the scipy.optimize package.

Source: https://docs.scipy.org/doc/scipy/reference/

Tags: