An introduction to robust optimization [... and a small shopping list that I forgot ...]

  • Tutorial
How to determine how many people need to hire a new fulfillment, what exactly is it to fill in and where to put a specific product? The larger the business becomes, the higher the uncertainty and the more expensive the error. Defeating chaos and choosing the optimal solution is one of the tasks of the data science team. And since the basis of data analysis is mathematics, let's begin with it.

In this post, we will consider optimization problems with uncertainty in the data and their approximation by deterministic convex problems. This is one of the main tricks in robust optimization - a technique that allows you to cope with optimization tasks that are too sensitive to changes in input data.

The issue of sensitivity is very important. For problems whose quality of solution weakly depends on changes in the data, it is easier to use the usual stochastic optimization. However, in tasks with high sensitivity, this approach will give a bad result. There are many such tasks in finance, supply management, design and many other areas.

And yes, this is an example of a post where complexity grows exponentially (soryan already) ...

What does “solve” an optimization problem mean?

Let's start with a brief reminder.

The optimization task in general form looks like this:

$ \ min_ {x \ in R ^ n} f (x) \\ st \\ x \ in X $

Here $ f (x) $ called the objective function, and $ X $- admissible set.

Under the solution of the optimization problem imply such a point$ x ^ * \ in x $Completed for:

$ f (x) - f (x ^ *) \ geq 0, \ quad \ forall x \ in X $

This is a standard concept for solving an optimization problem without uncertainty.

What is an optimization problem with uncertainty?

It's time to wonder about the origin of the function. $ f (x) $ and limitations $ X $.

Very useful to share

  • structural logic of the problem (in other words, what functions are used),
  • technical limitations (independent of human or data logic)
  • parameters that are estimated from the data.

For example, a person from business came to us and showed a linear programming task:

$ \ min_ {x \ in R ^ 2} 2.16 x_1 + 3.7 x_2 \\ st \\ 0.973 x_1 + 2.619 x_2 \ leq 3.32 \\ x_1 \ geq 0, x_2 \ geq 0 $

You see this task for the first time. Man, too (maybe not, but in blue jackets all so abstract!). The meaning of the variables you do not know. But even now it is possible to say with a high degree of confidence that:

  1. Most likely the problem is linear, because someone decided so. Linearity is the structure chosen by man.
  2. Restrictions $ x_1 \ geq 0, x_2 \ geq 0 $are technical. That is, they came from "physics", and not from the data (for example, sales cannot be negative).
  3. Specific Values $ \ {0.973, 2.619, 3.32 \} $ in limitation $ 0.973 x_1 + 2.619 x_2 \ leq 3.32 $in our example were estimated from the data. That is, first someone said that the variable$ x_1 $ linked to variable $ x_2 $, then it was said that the relationship is linear, and, finally, the coefficients in the equation of communication were estimated from the data. The same is true about coefficients.$ \ {2.16, 3.7 \} $ in the objective function.

When we talk about problems with uncertainty, we target precisely the uncertainty in the parameters estimated by the data. We do not touch the technical limitations or the initial choice of the structure of the task.

Let's go back to our story. We have a linear problem, the coefficients in it, someone somehow appreciated. If we were right about the nature of the coefficients in the function, then in fact we were asked to solve the problem for one scenario of events (a specific instance of the problem).

Sometimes this is enough for us, and we just solve it.

However, sometimes solving a puzzle for one scenario is a stupid idea (for example, if the solution is very sensitive to data variation).

What to do in this case, and how to model uncertainty in the data?

First, we note that the uncertainty in the data can always be transferred from the objective function to the constraints or vice versa. How to do this, look under the cut.

Transfer of uncertainty from the objective function to the constraints or vice versa
Часто оказывается удобнее перенести всю неопределенность в одну часть задачи: целевую функцию или ограничения.

Перенос неопределенности из целевого функционала в ограничения

Для любой задачи оптимизации

$\min_{x \in R^n} f_0(x, w)\\ s.t.\\ f_i(x, \theta^i) \leq 0, \quad 1\leq i \leq k\\ h_i(x, \beta^i) = 0, \quad 1\leq i\leq m\\ x \in X$

можно построить эквивалентную без неопределенности в целевом функционале:

$\min_{x \in R^n, t \in R} t\\ s.t.\\ f_0(x, w) \leq t\\ f_i(x, \theta^i) \leq 0, \quad 1\leq i \leq k\\ h_i(x, \beta^i) = 0, \quad 1\leq i\leq m\\ x \in X$

Решение $(x^*, t^*)$ эквивалентной задачи содержит решение исходной $x^*$.

Перенос неопределенности из ограничений в целевой функционал

Формально для любой задачи оптимизации с ограничениями

$\min_{x \in R^n} f(x)\\ s.t.\\ x \in X$

можно построить эквивалентную задачу без ограничений

$\min_{x \in R^n} f(x) + I_X(x) $

с помощью индикаторной функции

$I_X(x) = \begin{cases} 0, \quad x\in X\\ +\infty, \quad x\notin X \end{cases}$

Понятно, что ни один алгоритм такую функцию не переварит, но это и не нужно. Следующий логический шаг — аппроксимировать индикаторную функцию чем-то удобоваримым. Чем именно — зависит от ситуации (про это чуть позже). Так строятся, например, методы внутренней точки (частный случай методов штрафных фунций) и многие другие.

Stochastic, online, robust optimization and product list

We may have many scenarios of uncertainty, as well as options for what to do with it. We illustrate several standard approaches with a simple example.

I do not know how the situation is with a respected reader, but here I am married (successfully) and periodically go to the grocery store. With a leaflet, of course (gives invulnerability from impulse purchases). Sometimes not just to the store, but to the conditional Auchan, where it is cheaper, but where to go far.

We will simulate this situation: we came to Auchan with a leaflet in our hands for shopping.

Attention, the first question: how to model?

Input data: information about the products to purchase and the required quantity.

For convenience, we can think of a piece of paper as some kind of integer non-negative vector$ y \ in Z _ + ^ n $.

As variables, take, respectively, an integer non-negative vector.$ x \ in Z _ + ^ n $- how many and what products we will eventually buy (our solution).

Things are easy - take some objective function$ f (x, y) $which says how badly we made a mistake with the choice of products.

Depending on the context, the type of function may vary, but there are some basic requirements for it:

  • Function $ f (x, y) $ should have a minimum $ x ^ * = \ arg \ min_ {x \ in R ^ n} f (x, y) = y $ (that is, in the optimum we will buy exactly what is written in the leaflet)
  • Function $ f (x, y) $ should be convex on $ x $ (and preferably smooth) - to be able to effectively calculate $ min $.

Thus we get the task:

$ min_ {x \ in R ^ n} f (x, y) $

And now let's imagine that the leaflet stayed at home ...

So, with one remark we got into the world of tasks with uncertainty.

So, what to do if in the problem$ min_ {x \ in R ^ n} f (x, y) $ unknown to us $ y $?

The answer, again, depends on the context.

Stochastic optimization

Stochastic optimization (usually) involves

  • Uncertainty in the data has exactly the stochastic nature. Complete knowledge of the probability distribution of non-deterministic parameter values
  • Limitations including uncertainty are soft

In our example, if we modeled it using stochastic optimization, we would say

  • Okay, I don’t know what was written on the leaflet, but I’m been walking with leaflets for 8 years already, and I have quite a good knowledge of vector distribution $ y $
  • Even if I make a wrong choice (that is, with $ x $) returning home, I find out the real $ y $ and, if I’m completely locked in, I’ll go down to Pyaterochka and buy up there, albeit more expensive.
  • Now I choose this $ x $which will minimize some aggregate from the initial objective function and possible “penalties” for an error.

This will lead us to this task:

$ \ min_ {x \ in R ^ n} E_y [f (x, y) + \ psi (y, z)] \\ st \\ x + z \ geq y $

Note that in this task we have de facto decisions are made two times: first, the main purchase decision at Auchan, for which $ x $, then "error correction" with $ z $.

The main problems of this approach are:

  • There is often no information on the distribution of parameters.
  • Restrictions can be tough (for tasks with a high risk - death, ruin, nuclear or zombie appocalypse, etc.)
  • It is not always possible to “correct mistakes” (the decision is made once) or vice versa, decisions are made often (in this case, many nested integrals will appear, and it will be very difficult to count).

Online Optimization

Online optimization is a framework that studies consistent decision making. One of the standard approaches to modeling in this framework is multi-armed gangsters, about which Habré has been repeatedly written.

In the context of our toy example, we would:

  • never had (and never used before) leaflet
  • and at home we would be praised / scolded for those products that we bought (and we could only guess about the desired set)
  • the task would be to learn how to buy products as quickly as possible, like her former, imaginary prince, or the best of the sons of her mother's friend.

Robust optimization

Robust optimization is a logical extension of the idea of ​​a minimax solution.

Ideally, we should make a decision right now that will always remain valid regardless of the circumstances. People who designed pots, irons and refrigerators in the USSR did it in the canvass of robust optimization: the product should work even if it was used for 20 years as the main tool for the extermination of mutants that appeared after a nuclear war (it must also be experienced).

Moreover, I want the task to be crammed into an ordinary solver - and they do not understand the limitations “for any implementation of a random variable” (if these realizations are not a finite number).

In a problem with a leaflet, the decision should be made here and now and remain valid under any circumstances:

$ \ min_ {x \ in R ^ n, t \ in R} t \\ st \\ f (x, y) \ leq t \ quad \ forall y \\ x \ geq y \ quad \ forall y $

It is clear that even in this toy example, if nothing is required from $ y $then no meaningful decision will work.

So how to work with such tasks?

Construction of a robust version of the problem on the example of the LP problem

Consider a linear optimization problem with uncertainty:

$ \ min_ {x \ in R ^ n} c ^ Tx + d \\ st \\ Ax \ leq b $

Options $ \ begin {pmatrix} c ^ T, d \\ A, b \ end {pmatrix} $were derived from data and include uncertainty.

Assumption 1: Set of values ​​(implementations)$ \ begin {pmatrix} c ^ T, d \\ A, b \ end {pmatrix} $can be parameterized, i.e. there are such$\begin{pmatrix} c_0^T, d_0\\ A_0, b_0 \end{pmatrix}, \begin{pmatrix} c_1^T, d_1\\ A_1, b_1 \end{pmatrix}, \dots, \begin{pmatrix} c_k^T, d_k\\ A_k, b_k \end{pmatrix}$that any data implementation $\begin{pmatrix} c^T, d\\ A, b \end{pmatrix}$ lies in the set:

$\begin{pmatrix} c^T, d\\ A, b \end{pmatrix} \in U = \left \{\begin{pmatrix} c_0^T, d_0\\ A_0, b_0 \end{pmatrix} +\sum_{i=1}^k\zeta_i \begin{pmatrix} c_i^T, d_i\\ A_i, b_i \end{pmatrix} | \quad \zeta \in Q \subset R^k \right \}$

Here $\begin{pmatrix} c^T_0, d_0\\ A_0, b_0 \end{pmatrix}$ are called “nominal” data, and $\begin{pmatrix} c^T_i, d_i\\ A_i, b_i \end{pmatrix} \quad (1\leq i \leq k)$ - "shifts."

Mini example
Хочу немного пояснить их значение на модельном примере из финансов: задаче о выборе оптимального портфеля ценных бумаг. Допустим, вы хотите инвестировать. Сейчас на доступной вам бирже котируется $n$ акций, и нужно понять, как распределить свой капитал (вложить) по этим бумагам так, чтобы максимизировать свой доход при ограничении на риск. Одна из первых моделей для решения этой задачи (модель Марковица) предлагала делать следующее:

  1. Собрать исторические данные о доходности ценной бумаги: $r_i^t = \frac{S_i^t - S_i^{t-1}}{S_i^{t-1}}$, где $S_i^t$ — это цена актива $i$ в момент времени $t$.
  2. Найти эмпирические средние значения доходности бумаг $\hat{r}_i = \frac{1}{T}\sum_{t=1}^T r_i^t$ и эмпирическую матрицу ковариации доходностей $\Sigma = \|cov(r_i, r_j)\|_{i,j}$
  3. Решить задачу оптимизации

    $\max_{x\in R^n_+} x^T\hat{r}\\ s.t.\\ \frac{1}{2} x^T\Sigma x \leq \sigma\\ \sum_{i=1}^nx_i \leq 1$

Решением задачи является оптимальное распределение (доли) капитала по бумагам.

Фактически мы максимизируем ожидаемую доходность, или, ищем оптимальный портфель для одного сценария — случая, когда реализация случайной (!) доходности совпадет с эмпирическим средним.

В контексте параметризации $r$ именно $\hat{r}$ служит в качестве «номинальных» данных.

We already know that all the uncertainty in the problem can be removed in the constraints. Let's do it.

We get the task

$\min_{x\in R^n, t\in R}t\\ s.t.\\ c^Tx + d \leq t, \quad \forall \begin{pmatrix} c^T, d \end{pmatrix} \in U\\ Ax\leq b, \quad \forall \begin{pmatrix} A, b \end{pmatrix} \in U\\ $

Robust version of the task

Now it’s time for one of the coolest tricks in robust optimization — how to move from an infinite number of constraints to a finite set of good constraints.

To begin, consider a simple example when

$Q = \{ \zeta \in R^k| \|\zeta\|_2 \leq 1 \}$

All restrictions in the system

$c^Tx + d \leq t, \quad \forall \begin{pmatrix} c^T, d \end{pmatrix} \in U\\ Ax\leq b, \quad \forall \begin{pmatrix} A, b \end{pmatrix} \in U\\$

same type - this is just linear inequality. Learn to work with one - learn to work with everyone.

Therefore, we consider one constraint of the inequality type:

$a^Tx \leq b\quad \forall (a,b) \in U = \{ (a_0, b_0)+\sum_{i=1}^k \zeta_i \cdot (a_i, b_i)| \quad \zeta \in Q \}\\ (a_0+ \sum_{i=1}^k \zeta_i a_i)^Tx \leq b_0 + \sum_{i=1}^k \zeta_i b_i \quad \forall \zeta \in Q\\ \sum_{i=1}^k \zeta_i \cdot (a_i^T x - b_i) \leq b_0 - a_0^Tx \quad \forall \zeta \in Q\\ \max_{\zeta \in Q} \sum_{i=1}^k \zeta_i \cdot (a_i^T x - b_i) \leq b_0 - a_0^Tx$

I will explain what happened.

At first we transferred all parts with uncertainty to the left side of the inequality, for the uncertainty is responsible$\zeta$.
After that we looked at the worst case (for each$x$he is his)
As a result, we got the following entry:

$g(x) = max_{\zeta \in Q} f(x, \zeta) \leq b_0 - a_0^Tx$


The next step is to explicitly write out the function.$g(x)$. To do this, it is enough to solve the optimization problem by$\zeta$ and substitute the optimal $\zeta *$:

$\max_{\|\zeta\|_2 \leq 1} \sum_{i=1}^k \zeta_i (a_i^Tx- b_i) = \sqrt{\sum_{i=1}^k (a_i^Tx - b_i)^2}$

which leads to inequality:

$\sqrt{\sum_{i=1}^k (a_i^Tx-b_i)^2} +a_0^Tx \leq b_0$

Note that the resulting inequality is convex and any $x$ it satisfies satisfies the original $a^Tx \leq b$ for any implementation $(a,b) \in U$...

Restriction$\sqrt{\sum_{i=1}^k (a_i^Tx-b_i)^2} +a_0^Tx \leq b_0$ called the robust version of the restriction $a^Tx \leq b \quad \forall (a,b) \in U$.

This is one of the main workhorses in robust optimization — approximation of probabilistic constraints by a finite set of convex constraints.

What to do with more complex (non-linear) constraints?

Construction of robust versions of constraints using conic duality

A lot of standard nonlinear constraints can be represented in a conical form (that is, in the form of $Ax + b \in K$where $K$ is some closed convex cone):

  • Non-negativity $X \geq 0 \quad \leftrightarrow \quad x \in R^n_+$
  • Restrictions on norms $\|x\|_p \leq p\quad \leftrightarrow \quad \begin{pmatrix} x\\p \end{pmatrix} \in K_p^n =\left \{ (x,t) \in R^n\times R_+ | \quad \|x\|_p \leq t \right \}$
  • Restrictions on the positive definiteness of the matrix $x_1F_1 + \dots x_nF_n + G \succeq 0$

Let's return to the robust restrictions.

Suppose that the optimization problem by$\zeta$ managed to be reduced to conical shape

$\max_{\zeta} \sum_{i=1}^k \zeta_i (a_i^Tx - b_i)\\ s.t\\ C\zeta + d \in K$

We construct a dual for this problem.

Some time ago I published a post about conical duality exactly in order to give a little less attention to the technique itself.

$\min_{\lambda} \lambda^Td\\ s.t.\\ C^T \lambda +\begin{pmatrix} a_1^Tx - b_1\\ \dots\\ a_k^Tx - b_k \end{pmatrix} = 0_k\\ \lambda \in K^*$

Now it's up to the small one - the weak duality theorem:

$\max_{[\zeta: \quad C\zeta+d \in K]} \sum_{i=1}^k \zeta_i (a_i^Tx-b_i) \leq \min_{\lambda \in G} \lambda^Td\\ where\\ G = \left \{ \lambda| \quad C^T \lambda +\begin{pmatrix} a_1^Tx - b_1\\ \dots\\ a_k^Tx - b_k \end{pmatrix} = 0_k; \quad \lambda \in K^* \right \}$

Consequently, as a robust approximation of the original constraint $a^Tx \leq b, \quad (a,b) \in U$ can use restriction

$ \ lambda ^ Td \ leq b_0 - a_0 ^ Tx \\ G = \ left \ {\ lambda |  \ quad C ^ T \ lambda + \ begin {pmatrix} a_1 ^ Tx - b_1 \\ \ dots \\ a_k ^ Tx - b_k \ end {pmatrix} = 0_k;  \ quad \ lambda \ in K ^ * \ right \} $

Where $ \ lambda $ same variable as $ x $.

So we built a robust constraint for the original inequality.


We have considered the technique of approximation of bad (stochastic) constraints by a set of good convex ones. This can be useful, for example, if:

  • You do not want to write algorithms yourself, but the solver used does not know how to work with probabilistic constraints.
  • There is a problem with stochastic parameters, while the optimum is very sensitive to fluctuations in the data.
  • And, of course, tasks with uncertainty, where all restrictions are tough (the cost of the error is too high)

Also popular now: