# Smart Face Control: Machine Learning Algorithms for Effectively Caching Data on SSDs

This article was presented at the SECR2017 conference, where she received the Bertrand Meyer Award for Best Research Report.

In this article, Svetlana Lazareva, the head of the Reydiks research laboratory, talks about the new algorithm for filling the parallel cache in the storage system, which is based on the machine learning algorithm.

Original article and presentation .

## Hybrid systems

The main requirements placed by customers for data storage systems are the high performance of I / O operations provided by solid-state drives (SSDs) and the low cost of the operation that disk drives (HDDs) guarantee. The reason for the differences between hard drives and solid state drives lies in their internal device operation. To access a specific data block, an HDD needs to move its head and spin magnetic disks, which in the case of many random requests leads to high costs, while an SSD does not have mechanical parts and can serve both sequential and random requests at the same high speed.

But the high performance of solid state drives has its flaws. Firstly, it has a limited lifespan. Secondly, this is the price, which per unit of capacity is much higher than that of traditional HDDs. For this reason, data storage systems built exclusively from SSDs have received limited distribution and are used for especially loaded systems in which data access time is expensive.

Hybrid systems are a compromise between cheap but slow storage on HDD and high-performance but expensive storage on SSD. In such systems, solid state drives play the role of an additional cache (in addition to RAM) or tearing. In this article, we have given arguments for choosing a particular technology.

Using solid-state drives as cache devices in combination with a limited number of rewrite cycles can significantly accelerate their wear due to the logic of traditional caching algorithms leading to heavy cache use. Thus, hybrid storage systems require special caching algorithms that take into account the features of the cache device used - SSD.

The simplest implementation of the idea of a hybrid array is the second-level cache (see Figure 1), when all requests after RAM go to the SSD. The basis of this approach is the property of time locality - the probability of requesting the same data is higher, the shorter the time of the last call. Thus, using the SSD cache as a second level cache, we seem to increase the amount of RAM. But if there is enough RAM to process temporary locality, then this method is not effective. Another approach is to use the SSD cache as parallel (see Figure 2). Those. some requests go to RAM, some go directly to the SSD cache. This is a more complex solution, aimed at the fact that different caches process different locales. And one of the most important aspects of the implementation of the parallel level cache is the choice of the technique of filling the SSD cache. This article describes a new parallel cache filling algorithm - face control based on a machine learning algorithm.

The basis of the proposed solution is the use of data from query histories to data storage systems recorded over long periods of time. The decision is based on two hypotheses:

- Query histories describe a typical load on storage at least for a limited amount of time.
- The load of the storage system stores some kind of internal structure, which determines which blocks and with what frequency are requested.

While the first hypothesis is almost beyond doubt and remains an assumption, the second requires experimental verification.

*Fig. 1. The second level cache*

*Fig. 2. Parallel cache*

## Solution Architecture

The architecture of the proposed solution is presented in Fig. 3. All requests coming from customers first go through the analyzer. Its function is to detect multiple consecutive requests to the system.

*Fig. 3. Solution architecture*

Depending on the type, they are processed differently:

**Sequential recording**. Requests of this type do not cause problems and are processed quickly, so the data they access does not fall into the SSD cache.**Sequential reading**. Requests of this type are processed quickly due to read-ahead algorithms, usually such requests are unique (i.e., re-sequential reading of the same data is unlikely), so such data does not need to be cached in the SSD.**Random record**. Such requests cause noticeable overhead both during end-to-end recording and when it falls into RAM, as data when ejected from the cache will still require writing to hard drives. Therefore, they immediately, bypassing the RAM, are sent to the SSD write cache, where they are stored in a log-structured manner and are superseded as soon as possible for more efficient further recording to the hard disk array.**Random reading**. Data is cached in RAM or in SSD depending on the face control algorithm.

In this system, only random requests affect SSD wear. Random write requests are not optimizable. Thus, the only way to extend the life of an SSD is to optimize the processing of random read requests. If you have a way to assess the potential effect of caching specific data, you can get rid of the need to cache "unnecessary" data.

Works that offer solutions to extend the life of SSDs in hybrid storage systems in general can be divided into two groups. The first group considers the possibility of improving the efficiency of caching based on solid-state drives by using the internal logic of work and technical features [5, 1]. Such algorithms are highly dependent on the specifics of specific devices and therefore will not be further covered.

The work of the second group focuses on creating new caching algorithms. Everyone has a common idea: rarely used data should not go to the cache. This, firstly, reduces the number of entries compared to traditional algorithms and, secondly, increases the number of cache hits due to less clogging of the cache with less-used blocks.

**LARC - Lazy Adaptive Replacement Cache**[4]. The algorithm uses simple logic: if a block was requested at least twice for a certain period of time, then it will probably be needed in the future and gets into the SSD cache. Tracking blocks of candidates is carried out using an additional LRU-queue in RAM. The advantages of this algorithm include relatively low resource requirements.**WEC - Write-Efficient Caching Algorithm**[6]. Standard caching algorithms prevent frequently used blocks from being continuously cached. This algorithm is designed to isolate such blocks and ensure that they are in the cache for the duration of active access to them. This work also uses an additional cache queue, but with a more complex logic for determining candidate blocks.**Elastic Queue**[2]. Indicates a drawback of all previously proposed caching algorithms: lack of adaptability to various loads. Considers the work with the priority queue as a common basis, and the differences in the logic of assigning priorities and offers a generalized improvement of the queue from this point of view.

At the root of the algorithms discussed above lies the natural method in computer science - the development of algorithms, often of a heuristic nature, based on an analysis of the existing conditions and assumptions. This work offers a fundamentally different approach based on the fact that the workload of the data storage system is self-similar. Those. By analyzing past queries, we can predict future behavior.

## Machine Learning Algorithms

This problem can be approached from two sides: using traditional methods of machine learning or teaching representation (representational learning). The former first require data processing to extract relevant features from them, which are then used to train models. The methods of the second approach can directly use the initial data that have undergone minimal preprocessing. But, as a rule, these methods are of great complexity.

An example of teaching representation can be deep neural networks (see [3] for more details). You can try decision trees as a simple base model. Signs on which training will be carried out can be:

- Average stack time (mean stack time)
- Read / Write Ratio
- The ratio of the number of random and sequential queries that are calculated on fixed segments of the query history.

Promising methods of teaching presentation. The available data are time series, which can be considered when choosing suitable algorithms. The only group of algorithms that use information about a sequential data structure in calculations are recurrent neural networks [3]. Therefore, models of this type are the main candidates for solving the task.

### Data

Data is the history of storage queries. The following information is known about each request:

- Request arrival time;
- The logical address of the first block (logical block address - lba);
- Request Size (len);
- Type of request - write or read;
- The type of request is sequential or random.

The experiments and data analysis were carried out on real loads received from different clients. A file with the history of requests from clients will be called a log in the future.

To analyze the effectiveness of a particular caching algorithm, you need to enter a metric. Standard metrics, such as the number of cache hits and the average data access time, are not suitable, because they don’t take into account the cache usage. Therefore, we will use Write Efficiency [6,2] as a metric for caching efficiency:

$$

Query history will be divided into blocks of sequential queries. The block size is denoted as blocksize. For a single request, the metric is calculated as follows: in the history, the nearest windowsize requests are viewed and the number of requests to the same data is calculated as in the original one. For a block from blocksize queries, the average value of the metric per block is calculated.

Then the blocks are divided into two classes: “good” and “bad” (see Fig. 4). “Bad” (red) means that it is not recommended to send a request block to the SSD cache (it is sent to RAM), “good” (green), respectively, that this request block is sent to the SSD cache. Those blocks for which the coefficient WE is greater than the parameter m will be considered “good”. The remaining blocks will be considered "bad."

*Fig. 4. We break logs into blocks*

Computing a metric for large windowsize values can take a considerable amount of time. Computational constraints are placed on how far one can reuse a block reuse. Two questions arise:

- How to choose the windowsize option?
- Is the reuse of blocks within windowsize a representative value for judging the reuse of these blocks in general?

The following experiment was conducted to answer them:

- A random sample of queries was taken;
- The reuse of these blocks within a significantly larger windowsizemax is calculated;
- The Pearson correlation coefficient between the reuse of blocks within windowsizemax and windowsize is calculated, where windowsize are different test values.

In fig. 5 shows the change in the correlation coefficient in the sample over time. Table 1 presents the value of the correlation coefficient for the entire sample. The result is significant (P-value close to zero). It can be interpreted as follows: blocks that are in demand in the coming windowsize requests are also likely in demand in the coming windowsizemax requests.

*Fig. 5. Dynamics of the correlation coefficient between windowsize and windowsizemax*

Comment to Fig. 5. From top to bottom, windowsize: 100000, 10000, 30000, 3000. Sample size: 1,000,000 requests. X axis: running window number of 1000 requests; y axis: correlation coefficient value for a running window.

Comment to Fig. 5. From top to bottom, windowsize: 100000, 10000, 30000, 3000. Sample size: 1,000,000 requests. X axis: running window number of 1000 requests; y axis: correlation coefficient value for a running window.

*Table 1. Pearson correlation coefficient for various windowsize.*

windowsize | 3000 | 10,000 | 30000 | 100,000 |
---|---|---|---|---|

1,000,000 | 0.39 | 0.61 | 0.82 | 0.87 |

Thus, to calculate the metric, the windowsize value equal to 100000 is taken.

For constructing probabilistic models, the data set is standard [3, p. 120] was divided into three parts:

- Training sample: 80%;
- Test sample: 10% (for final testing of the model);
- Validation sample: 10% (to select parameters and track model progress during training, regardless of the training sample).

It should be noted that in the case of deep learning models, sampling cannot be made random, since in this case the temporal connections between requests will be destroyed.

### Deep learning models

Building a deep learning model requires many decisions. It is not possible to provide a detailed explanation of each of them in this paper. Some of them were selected according to the recommendations in the literature [3] and original articles. In them you can find detailed motivation and explanation.

Common parameters:

- The objective function is cross entropy (a standard choice for the classification problem) [3+, p. 75];
- Optimization algorithm - Adam [7];
- Parameter initialization method - Glorot [8];
- The regularization method is Dropout [9] with parameter 0.5.

Tested parameters:

- The number of recurrent layers of the neural network: 2, 3, 4;
- The number of output neurons in each lstm unit: 256, 512;
- Number of time steps of a recurrent neural network: 50,100;
- Block size (number of requests in a block): 100, 300, 1000.

#### Basic Deep Learning Method - LSTM RNN

Models of varying complexity were tested. For all, the result is negative. Learning dynamics are presented in Fig. 6.

*Figure 6. LSTM RNN. Change in the value of the loss function during training.*

Commentary on Fig. 6. Red - training sample, green - test. Graph for a medium complexity model with three recurrent layers.

Commentary on Fig. 6. Red - training sample, green - test. Graph for a medium complexity model with three recurrent layers.

#### Pre-training and LSTM RNN

Pre-training on the task of modeling the flow of requests (forecasting the next request) was tested. The root-mean-square error between the predicted value of lba and the present was used as the objective function. The result is negative: the learning process on the side task does not converge. Learning dynamics are presented in Fig. 7.

*Figure 7. Pre-training on the prediction problem lba of the next query Change in the mean-square error during training.*

Commentary on fig. 7. Red - training sample, green - test. Graph for a model with two recurrent layers.

Commentary on fig. 7. Red - training sample, green - test. Graph for a model with two recurrent layers.

#### HRNN

Models were tested for different sets of parameters, with two levels: the first for each block, the second for many blocks. The result of the experiment is negative. Learning dynamics are presented in Fig. 8.

*Figure 8. HRNN. Change in the value of the loss function during training.*

Commentary on fig. 8. Red - training sample, green - test. Graph for a medium complexity model with two recurrent layers on the second level.

Commentary on fig. 8. Red - training sample, green - test. Graph for a medium complexity model with two recurrent layers on the second level.

As you can see, neural networks did not give good results. And we decided to try the traditional methods of machine learning.

### Traditional machine learning methods. Attributes and signs.

#### Frequency characteristics and derivatives thereof

We introduce the following notation:

- count_lba - the number of uses of lba within the block;
- count_diff_lba - frequency of differences between lba of the current request and the previous one;
- count_len - frequency of occurrence of requests to blocks of a given length len;
- count_diff_len - frequency of occurrence of differences between len of the current and previous requests.

Further, for each sign, statistics are calculated - the moments: mean, standard deviation, asymmetry coefficient and excess coefficient.

#### Difference characteristics and derivatives thereof

The difference between the lba of the current and i-previous block (let's call it diff_lba_i) can serve as the basis for a new feature. Each such diff_lba_i difference is distributed symmetrically and has an average of zero. Therefore, the differences between the distributions of diff_lba_i for different i can be described using a single number - the standard deviation of this quantity. Thus, for each block, a set of n values std_diff_lba_i is obtained, where i = 1..10 (let's call this set lba_interdiffs). The hypothesis is that for good query blocks these values will differ less, in other words, they will be more concentrated than for bad blocks. The concentration of values can be characterized using the mean and standard deviation. The differences between them can be seen in Fig. 9.10 which shows

Similarly, characteristics are calculated for the lengths of the requested data.

*Fig. 9. The differences between classes for average and standard deviations of feature count_lba*

*Fig. 10. Differences for classes between mean and standard deviations of the attribute lba_interdiffs*

#### Feature Analysis

To analyze the differences between classes with respect to features, boxplots were constructed. For example, in fig. 9. the differences of 1 and 2 moments of the sign count_lba are displayed, and in fig. Figure 10 shows the differences 1 and 2 of the lba_interdiffs attribute between classes.

Span diagrams show that not a single sign shows enough significant differences between classes to determine only on its basis. At the same time, the totality of weak non-correlating features can distinguish between classes of blocks.

It is important to note that when calculating the characteristics, it is necessary to take a large value of the blocksize parameter, since on small query blocks, many signs turn to zero.

#### Ensembles of decision trees

An example of traditional machine learning methods are ensembles of decision trees. One of its effective implementations is the XGBoost library [10], which has shown good results in many data analysis competitions.

The dynamics of the learning process can be seen in Fig. 11. On the test sample, a classification error of about 3.2% is achieved. In fig. 12 you can see the forecast of the model in the test sample.

*Fig. 11. XGBoost. Changing the accuracy of classification throughout training. Red - training sample, green - test.*

*Figure 12. XGBoost. Model prediction on a random test sample*

Комментарий к рис. 12. Цветом обозначен прогноз модели: красный — плохо кэшируемый блок, зеленый — хорошо кэшируемый. Степень насыщенности цвета — это вероятность, выведенная моделью. Ось x: номер блока в тестовой выборке, ось y: исходное значение метрики. Пунктирная линия — барьер метрики, по которому блоки были разделены на два класса для обучения модели. Ошибка классификации 3.2%.

Комментарий к рис. 12. Цветом обозначен прогноз модели: красный — плохо кэшируемый блок, зеленый — хорошо кэшируемый. Степень насыщенности цвета — это вероятность, выведенная моделью. Ось x: номер блока в тестовой выборке, ось y: исходное значение метрики. Пунктирная линия — барьер метрики, по которому блоки были разделены на два класса для обучения модели. Ошибка классификации 3.2%.

#### Ошибки 1 рода

In the caching problem, it is important to avoid mistakes of the first kind, that is, false positive forecasts. In other words, it's better to skip a block suitable for caching than write

unclaimed data to the cache . To balance between the probability of issuing false positive and false negative forecasts, you can change the probability barrier, after which the forecast will be considered positive. The dependence of these quantities is shown in Fig. 13.

*Fig. 13. The dependence of the probability of issuing false positive and false negative forecasts depending on the probability barrier, after which the forecast is considered positive.*

Commentary on fig. 13. X axis: probability barrier, y axis: probability of error. The dashed line is the probability of a false positive error; the solid line is the probability of a false negative error.

Commentary on fig. 13. X axis: probability barrier, y axis: probability of error. The dashed line is the probability of a false positive error; the solid line is the probability of a false negative error.

## Final Algorithm - Face Control

We describe the final algorithm that describes the technique of filling the SSD cache. It consists of three parts: the algorithm, which forms the training set from the client’s logs, the algorithm, which forms the prediction model (algorithm 2), and the final online algorithm, which classifies incoming request blocks as “good” and “bad”.

### Algorithm 1. Formation of the training set

- We divide the available selection into blocksize blocks.
- For each block, we consider the signs from paragraphs 3.3.1 and 3.3.2.
- For each block, we consider the coefficient WE.
- For each block, using the WE coefficient, we determine the block class: good or bad.
- We make pairs (signs - a class). The attributes of one block are assigned the class of the next block.

### Algorithm 2. Teacher

**Input:**list (class attributes) from Algorithm 1.

**Main loop:**XGBoost algorithm.

**Output:**predictor.

### Algorithm. Face control

**Input:**block of requests of length blocksize, m.

**The main cycle: the**predictor from the teacher algorithm.

**Output:**If the block is determined to be good, then the next block of requests goes to the SSD. If the block is determined to be bad, then the next block of requests goes to RAM. If it was not possible to determine, then the requests from the next block go to the SSD if their frequency is greater than one (the LRU queue method from the LARC algorithm is used).

## Comparison of face-control algorithm and popular SSD cache implementations

The face control algorithm was tested in a mathematical model, where it was compared with the well-known SSD cache implementations. The results are presented in table 2. In the RAM, the LRU crowding-out algorithm is everywhere. In the SSD cache, the LRU, LRFU, LARC preemptive techniques. Hits is the number of hits in millions. Face control was implemented together with the LRFU displacement technique, it showed the best results together with face control.

As we see, when implementing face control, the number of entries on the SSD is significantly reduced, thereby increasing the service life of the solid-state drive and according to the WE metric, the algorithm with face control is the most effective.

*Table 2. Comparison of various SSD cache implementations*

LRU | Lrfu | Larc | Face Control + LRFU | |
---|---|---|---|---|

RAM hits | 37 | 37 | 37 | 36.4 |

RRC hits | 8.9 | 8 | 8.2 | 8 |

HDD hits | 123.4 | 124.3 | 124.1 | 124.9 |

SSD writes | 10.8 | 12.1 | 3.1 | 1.8 |

WE | 0.8 | 0.66 | 2.64 | 4.44 |

## Conclusion

Hybrid storage systems are very popular in the market due to the optimal price of 1 gigabyte. Higher performance is achieved by using solid-state drives as a buffer.

But SSDs have a limited number of records — rewrites; accordingly, using standard techniques to implement a buffer is not economically profitable, see Table 2, since SSDs wear out quickly and require replacement.

The proposed algorithm increases the life of an SSD by 6 times compared to the simplest LRU technique.

Face control analyzes incoming requests using machine learning methods and does not let “unnecessary” data into the SSD cache, thereby improving the caching efficiency metric several times.

In the future, it is planned to optimize the blocksize parameter and improve the extrusion technique, as well as try other combinations of the extrusion technique in the SSD cache and RAM.

## List of references

- Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems. / Yongseok Oh, Jongmoo Choi, Donghee Lee, Sam H Noh // FAST. –– Vol. 12. –– 2012.
- Elastic Queue: A Universal SSD Lifetime Extension Plug-in for Cache Replacement Algorithms / Yushi Liang, Yunpeng Chai, Ning Bao et al. // Proceedings of the 9th ACM International on Systems and Storage Conference / ACM. –– 2016. –– P. 5.
- Goodfellow Ian, Bengio Yoshua, Courville Aaron. Deep learning –– 2016. –– Book in preparation for MIT Press. URL
- Improving flash-based disk cache with lazy adaptive replacement /Sai Huang, Qingsong Wei, Dan Feng et al. // ACM Transactions on Storage (TOS). –– 2016. –– Vol. 12, no. 2. –– P. 8.
- Kgil Taeho, Roberts David, Mudge Trevor. Improving NAND flash based disk caches // Computer Architecture, 2008. ISCA’08. 35th International Symposium on / IEEE. –– 2008. –– P. 327–338.
- WEC: Improving Durability of SSD Cache Drives by Caching WriteEfficient Data / Yunpeng Chai, Zhihui Du, Xiao Qin, David A Bader //IEEE Transactions on Computers. –– 2015. –– Vol. 64, no. 11. –– P. 3304– 3316.
- Kingma Diederik, Ba Jimmy. Adam: A method for stochastic optimization // arXiv preprint arXiv:1412.6980. –– 2014.
- Glorot Xavier, Bengio Yoshua. Understanding the difficulty of training deep feedforward neural networks. // Aistats. –– Vol. 9. –– 2010. –– P. 249–256.
- Graves Alex. Supervised Sequence Labelling with Recurrent Neural Networks. –– Springer, 2012. –– P. 5–13.
- Chen Tianqi, Guestrin Carlos. XGBoost: A Scalable Tree Boosting System // CoRR. –– 2016. –– Vol. abs / 1603.02754. –– URL