Data Obfuscation for Performance Tests
ClickHouse users know that its main advantage is the high speed of processing analytical queries. But how can we make such statements? This should be supported by trusted performance tests. We’ll talk about them today.
We started conducting such tests in 2013, long before the product became available in open source. As now, then we were most interested in the speed of the Yandex.Metrica data service. We have already stored data in ClickHouse since January 2009. Part of the data has been written to the database since 2012, and part - has been converted from OLAPServer and Metrage
- data structures that were used in Yandex.Metrica before. Therefore, for the tests, we took the first available subset of 1 billion pageview data. There were no queries in Metric yet, and we came up with queries that are most interesting to us ourselves (all kinds of filtering, aggregation and sorting).
ClickHouse was tested in comparison with similar systems, for example, Vertica and MonetDB. To be honest, it was conducted by an employee who had not previously been a ClickHouse developer, and particular cases in the code were not optimized until the results were obtained. Similarly, we got a data set for functional tests.
After ClickHouse joined the open source in 2016, there were more questions for the tests.
Performance tests:
You can solve these problems - throw out old tests and make new ones based on open data. Among the open data, you can take data on flights to the USA , taxi rides in New York, or use ready-made benchmarks TPC-H, TPC-DS, Star Schema Benchmark . The inconvenience is that this data is far from Yandex.Metrica data, and I would like to save test requests.
You need to test the performance of the service only on real data from production. Let's look at a few examples.
Example 1
Suppose you populate a database with uniformly distributed pseudorandom numbers. In this case, data compression will not work. But data compression is an important property for analytical DBMSs. Choosing the right compression algorithm and the right way to integrate it into the system is a non-trivial task in which there is no one right solution, because data compression requires a compromise between the compression and decompression speed and the potential compression ratio. Systems that do not know how to compress data lose guaranteed. But if we use uniformly distributed pseudorandom numbers for tests, this factor is not considered, and all other results will be distorted.
Conclusion: test data should have a realistic compression ratio.
About optimization of data compression algorithms in ClickHouse I talked about in a previous post .
Example 2
Let us be interested in the speed of the SQL query:
This is a typical request for Yandex.Metrica. What is important for its speed?
But it is also very important that the amount of data for different regions is distributed unevenly. (Probably, it is distributed according to a power law. I built a graph in the log-log scale, but I'm not sure.) And if so, then it is important for us that the states of the uniq aggregate function with a small number of values use little memory. When there are many different aggregation keys, the count goes to units of bytes. How to get the generated data that have all these properties? Of course, it is better to use real data.
Many DBMSs implement the HyperLogLog data structure for the approximate calculation of COUNT (DISTINCT), but they all work relatively poorly because this data structure uses a fixed amount of memory. ClickHouse has a feature that uses a combination of three different data structures., depending on the size of the set.
Conclusion: test data should represent the properties of the distribution of values in the data - cardinality (the number of values in the columns) and mutual cardinality of several different columns.
Example 3
Well, let us test the performance not of the ClickHouse analytic DBMS, but of something simpler, for example, hash tables. For hash tables, choosing the right hash function is very important. For std :: unordered_map, it is slightly less important, because it is a chain-based hash table, and a prime number is used as the size of the array. In the standard library implementation in gcc and clang, a trivial hash function is used as the default hash function for numeric types. But std :: unordered_map is not the best choice when we want to get maximum speed. When using open-addressing hash tables, the correct choice of the hash function becomes a decisive factor, and we cannot use the trivial hash function.
It is easy to find hash table performance tests on random data, without considering the hash functions used. It is also easy to find hash function tests with an emphasis on computation speed and some quality criteria, however, in isolation from the data structures used. But the fact is that hash tables and HyperLogLog require different quality criteria for hash functions.
Read more about this in the report “How Hash Tables in ClickHouse Are Arranged” . It is a bit outdated since it does not consider swiss tables .
We want to get the data for performance testing on the structure the same as Yandex.Metrica data, for which all the properties important for benchmarks are stored, but so that no traces of real site visitors are left in this data. That is, the data should be anonymized, and the following should be stored:
And what is needed for the data to have a similar compression ratio? For example, if LZ4 is used, then the substrings in the binary data should be repeated at approximately the same distances and the repetitions should be approximately the same length. For ZSTD, a byte entropy match is added.
The maximum goal: to make a tool available to external people, with which everyone can anonymize their data set for publication. So that we debug and test performance on other people's data similar to data from production. And I would like it to be interesting to watch the generated data.
This is an informal statement of the problem. However, no one was going to give a formal statement.
The importance of this task should not be exaggerated for us. In fact, it was never in the plans and no one was even going to do it. I simply did not give up hope that something would come up with itself, and suddenly I would have a good mood and a lot of things that could be postponed until later.
The first idea is to select a family of probability distributions that models it for each column of the table. Then, based on the statistics of the data, select the model fitting parameters and generate new data using the selected distribution. You can use a pseudo-random number generator with a predefined seed to get a reproducible result.
For text fields, you can use Markov chains - an understandable model for which you can make an effective implementation.
True, some tricks are required:
All this is presented in the form of a program in which all distributions and dependencies are hard coded - the so-called “C ++ script”. However, Markov models are still calculated from the sum of statistics, smoothing and roughening using noise. I started writing this script, but for some reason, after I explicitly wrote the model for ten columns, it suddenly became unbearably boring. And in the hits table in Yandex.Metrica back in 2012 there were more than 100 columns.
This approach to the task was a failure. If I approached her more diligently, for sure the script would have been written.
Advantages:
Disadvantages:
And I would like a more general solution - so that it can be applied not only to Yandex.Metrica data, but also to obfuscate any other data.
However, improvements are possible here. You can not manually select models, but implement a catalog of models and choose the best among them (best fit + some kind of regularization). Or maybe you can use Markov models for all types of fields, and not just for text. Dependencies between data can also be understood automatically. To do this, you need to calculate the relative entropies (relative amount of information) between the columns, or more simply, the relative cardinalities (something like “how many different A values on average for a fixed value of B”) for each pair of columns. This will make it clear, for example, that URLDomain is completely dependent on the URL, and not vice versa.
But I also refused this idea, because there are too many options for what needs to be taken into account, and it will take a long time to write.
I already told how important this task is for us. No one even thought about making any creeps to its implementation. Fortunately, colleague Ivan Puzyrevsky then worked as a teacher at the HSE and at the same time was developing the YT core. He asked if I had any interesting tasks that could be offered to students as a diploma topic. I offered him this one and he assured me that it was suitable. So I gave this task to a good person “from the street” - Sharif Anvardinov (the NDA is signed for working with data).
He told him about all his ideas, but most importantly, he explained that the problem can be solved in any way. And just a good option would be to use those approaches that I don’t understand at all: for example, generate a text data dump using LSTM. It looked encouraging thanks to the article The Unreasonable Effectiveness of Recurrent Neural Networks , which then caught my eye.
The first feature of the task is that you need to generate structured data, not just text. It was not obvious whether a recurrent neural network would be able to generate data with the desired structure. There are two ways to solve this. The first is to use separate models for generating the structure and for the “filler”: the neural network should only generate values. But this option was postponed until later, after which they never did. The second way is to simply generate a TSV dump as text. Practice has shown that in the text part of the lines will not correspond to the structure, but these lines can be thrown out at loading.
The second feature - a recurrent neural network generates a sequence of data, and the dependencies in the data can only follow in the order of this sequence. But in our data, perhaps the order of the columns is reversed with respect to the dependencies between them. We did not do anything with this feature.
Toward summer, the first working Python script appeared that generated data. At first glance, the data quality is decent:
True, difficulties were revealed:
Sharif tried to study data quality by comparing statistics. For example, I calculated the frequency with which different symbols appear in the source and in the generated data. The result was stunning - the most common characters are Ð and Ñ.
Do not worry - he defended his diploma perfectly, after which we safely forgot about this work.
Suppose that the statement of the problem is reduced to one point: you need to generate data for which the compression ratios will be exactly the same as the original data, while the data must be uncompressed at exactly the same speed. How to do it? Need to edit bytes of compressed data directly! Then the size of the compressed data will not change, but the data itself will change. Yes, and everything will work quickly. When this idea appeared, I immediately wanted to implement it, despite the fact that it solves some other problem instead of the original one. It always happens.
How to edit a compressed file directly? Suppose we are only interested in LZ4. Data compressed using LZ4 consists of two types of instructions:
Background:
Compressed data (conventionally)
In the compressed file, leave match as is, and in literals we will change the byte values. As a result, after decompression, we obtain a file in which all repeating sequences of at least 4 length are also repeated, and repeated at the same distances, but at the same time consist of a different set of bytes (figuratively speaking, not a single byte was taken from the original file in the modified file )
But how to change bytes? This is not obvious because, in addition to column types, data also has its own internal, implicit structure, which we would like to preserve. For example, text is often stored in UTF-8 encoding - and we want valid UTF-8 in the generated data too. I made a simple heuristic to satisfy several conditions:
The peculiarity of this approach is that the initial data themselves act as a model for the data, which means that the model cannot be published. It allows you to generate the amount of data no more than the original. For comparison, in previous approaches it was possible to create a model and generate an unlimited amount of data on its basis.
Example for URL: The result pleased me - it was interesting to look at the data. But still, something was wrong. The URLs are still structured, but in some places it was too easy to guess yandex or avito. Made a heuristic that sometimes rearranges some bytes in some places.
Other considerations worried. For example, sensitive information can be represented in a column of type FixedString in binary form and for some reason consist of ASCII control characters and punctuation, which I decided to save. And I do not consider data types.
Another problem: if data of the type “length, value” is stored in one column (this is how columns of type String are stored), then how to ensure that the length remains correct after the mutation? When I tried to fix it, the task immediately became uninteresting.
But the problem is not solved. We conducted several experiments, and it only got worse. It remains only to do nothing and read random pages on the Internet, because the mood is already spoiled. Fortunately, on one of these pages was a parsing of the algorithm for rendering the scene of the death of the protagonist in the game Wolfenstein 3D.
Beautiful animation - the screen fills with blood. The article explains that this is actually a pseudo-random permutation. Random permutation - a randomly selected one-to-one transformation of a set. That is, the display of the whole set in which the results for different arguments are not repeated. In other words, this is a way to iterate over all the elements of a set in a random order. It is such a process that is shown in the picture - we paint over each pixel, randomly selected from all, without repetition. If we just chose a random pixel at every step, then it would take us a long time to paint over the last one.
The game uses a very simple algorithm for pseudo-random permutation - LFSR(linear feedback shift register). Like pseudo-random number generators, random permutations, or rather their families, parameterized by the key, can be cryptographically strong - this is exactly what we need to convert the data. Although there may be unobvious details. For example, cryptographically strong encryption of N bytes to N bytes with a pre-fixed key and initialization vector, it would seem, can be used as a pseudo-random permutation of many N-byte strings. Indeed, such a transformation is one-to-one and looks random. But if we use the same transformation for all our data, then the result may be subject to analysis, because the same initialization vector and key values are used several times. This is similar toElectronic Codebook work block ciphers.
What are the ways to get a pseudo-random permutation? You can take simple, one-to-one transformations and assemble a fairly complex function from them that will look random. Let me give you an example of some of my favorite one-to-one transformations:
So, for example, from three multiplications and two xor-shift operations, the murmurhash finalizer is assembled . This operation is a pseudo-random permutation. But just in case, I note that hash functions, even N bits in N bits, do not have to be mutually unique.
Or here's another interesting example from elementary number theory from the site of Jeff Preshing.
How can we use pseudo-random permutations for our task? You can convert all numeric fields using them. Then it will be possible to preserve all the cardinalities and mutual cardinalities of all combinations of fields. That is, COUNT (DISTINCT) will return the same value as before the conversion, and with any GROUP BY.
It is worth noting that the preservation of all cardinalities is slightly contrary to the statement of the problem of data anonymization. For example, we know that in the initial data on site visits there is a user who visited sites from 10 different countries, and we want to find him in the converted data. But in the converted data, he also visited websites from 10 different countries - this is important information that will narrow the search. Even if we find what the user has converted to, it will not be very useful, because all the other data has been converted too - for example, we will not be able to understand which sites he visited. But such rules can be applied in a chain. For example, we can know a priori that the most popular site in our data is Yandex, and in second place Google, and only by maintaining the ranks understand that some converted site identifiers actually mean Yandex and Google. But this is not surprising - I remind you that the statement of the problem is not formal, and we want to find some balance between anonymizing data (hiding information) and preserving their properties (disclosing information). For information on how to more reliably approach the task of data anonymization, readarticle .
In addition, I want to leave as the source not only the cardinality of the values, but also the approximate orders of magnitude. For example, if the source data were numbers less than 10, then after the conversion there will also be small numbers. How to achieve this?
For example, let's try to break the set of possible values into size classes and perform permutations within each class separately (size classes remain). For this, the easiest way to take for size class is the nearest power of two or the position of the high bit of the number, which is the same. The numbers 0 and 1 will always remain as they are. The numbers 2 and 3 with a probability of 1/2 will sometimes remain as they are and with a probability of 1/2 they will be interchanged with each other. The set of numbers 1024..2047 will be rearranged by one of 1024! (factorial) options. Etc. For signed numbers, you can save the sign as is.
It is also not obvious whether we need a one-to-one function. You can probably just use a cryptographically strong hash function. The transformation will turn out not to be ambiguous, but the cardinalities will not differ much.
True, we need a cryptographically strong random permutation so that we can select a key and obtain a permutation with this key so that it is difficult to restore the original data from the rearranged data without knowing the key.
But there is a problem - I not only understand nothing in neural networks and machine learning, but in cryptography I am also an amateur. Only courage remains. At this time, I just continued to read random pages on the Internet. The Fabien Sanglard page had a link from Hackers News , and there were comments with a discussion. They included a link to a Redis developer blog article , Salvatore Sanfilippo. It said that to obtain random permutations there is a wonderful generalized method called the Feistel network.
Feistel's network consists of rounds. Each round is an amazing transformation that allows you to get a one-to-one function from any function. Let's see how it works.
There is also a statement that if we take a cryptographically stable pseudo-random function as F and apply the Feistel round at least 4 times, we get a cryptographically stable pseudo-random permutation.
It looks like a miracle: we take a function that gives out random garbage based on data, we substitute it into the Feistel network - and now we have a function that gives out random garbage based on data, but it is completely reversible!
Feistel's network is at the heart of some data encryption algorithms. And what we will do will just become a kind of encryption, but only very bad. There are two reasons for this:
So we can obfuscate numerical fields while maintaining the properties we need. For example, the compression ratio using LZ4 should remain approximately the same, because repeated values in the original data will be repeated in the converted data, moreover, at the same distances from each other.
For problems of data compression, predictive input, speech recognition, and also for generating random strings, text models are used. The text model is the probability distribution of all possible lines. Imagine, for example, an imaginary probabilistic distribution of the texts of all books that humanity can write. To generate a string, we simply play a random variable with this distribution and return the resulting string (a random book that humanity can write). But how do we know this probability distribution of all possible strings?
First, in explicit form it would require too much information: for example, there are 256 ^ 10 of all possible rows of 10 bytes in length, and it would take quite a lot of memory to explicitly write a table with the probability of each row. Secondly, we do not have enough statistics to reliably estimate this distribution.
Therefore, a probability distribution is used as a text model, which is obtained from rougher statistics. For example, you can separately calculate the probability with which each letter occurs in the text. And then generate the lines, choosing each subsequent letter with such a probability. Such a primitive model works, but the strings are still very unnatural.
For a slight improvement of this model, it is possible to have the probability of not only each individual letter, but also the conditional probability of occurrence of the letter - provided that it has N some previous letters. N is a pre-fixed constant. For example, N = 5, and we consider the probability of meeting the letter “e” after the letters “squeeze”. Such a text model is called the Order-N Markov model. You can observe the work of Markov models on the Yandex.Referat website
(former vesna.yandex.ru). Unlike LSTM neural networks, they only have enough memory for a small context of a fixed length N, so they generate funny, incoherent text. Markov models are also used in primitive methods of spam generation - generated texts can be easily distinguished from real ones by counting statistics that go beyond the model. Note the advantage: Markov models work orders of magnitude faster than neural networks, and this is what we need.
Example for Title:
So, we can calculate statistics on the source data, create a Markov model and generate new data on its basis. Of the nuances, we immediately note the need for coarsening (smoothing) the model so as not to disclose information about rare combinations in the source data - this is not a problem. I use a combination of order models from 0 to N, where in case of insufficient statistics for a model of order N, a model of order N - 1 is used).
But we still want to keep the data cardinal. For example, if there were 123,456 unique URL values in the source data, then even as a result there will be about the same. To do this, as a random number generator for playing out random variables, we will use an initialized random number generator in a deterministic way. The easiest way to do this is to use a hash function. We will apply it to the initial value. That is, we get a pseudo-random result, uniquely determined by the initial value.
Another requirement: let there be a lot of different URLs in the source data that start with the same prefix, but go on differently. For example:
After four tried and tested methods, the problem bothered me so much that it was time to choose an option, place it in the form of a convenient tool and finally declare it a solution. I chose a solution using random permutations and Markov models parameterized with a key. It is designed as a clickhouse obfuscator program, which is very easy to use: a table dump is sent to the input in any supported format (for example, CSV, JSONEachRow), the table structure (column names and types) and the secret key (arbitrary string) are specified in the command line parameters , it can be forgotten immediately after use). At the output we get the same number of lines of obfuscated data.
The program is installed with clickhouse-client, has no dependencies and works on almost any Linux. You can apply it to dumps of any databases, not just ClickHouse. For example, you can use it to generate test data from MySQL or PostgreSQL databases, as well as to create development databases similar to production.
Not everything is so simple, because the data conversion by this program is almost completely reversible. Is it possible to carry out the inverse transformation without knowing the key? If the conversion were carried out by a cryptographic algorithm, then the complexity of such an operation would be the same as the complexity of a complete search. And although some cryptographic primitives are used for conversion, they are used in the wrong way, and the data is subject to some analysis methods. To avoid problems, the documentation for the program (available on
We converted the dataset we need for functional tests , and the director according to Yandex approved his publication.
clickhouse-datasets.s3.yandex.net/hits/tsv/hits_v1.tsv.xz
clickhouse-datasets.s3.yandex.net/visits/tsv/visits_v1.tsv.xz
Developers outside Yandex use this data for real performance tests when optimizing algorithms inside ClickHouse. Third-party users can provide us with their data in an obfuscated form so that we can make ClickHouse even faster for them.
- data structures that were used in Yandex.Metrica before. Therefore, for the tests, we took the first available subset of 1 billion pageview data. There were no queries in Metric yet, and we came up with queries that are most interesting to us ourselves (all kinds of filtering, aggregation and sorting).
ClickHouse was tested in comparison with similar systems, for example, Vertica and MonetDB. To be honest, it was conducted by an employee who had not previously been a ClickHouse developer, and particular cases in the code were not optimized until the results were obtained. Similarly, we got a data set for functional tests.
After ClickHouse joined the open source in 2016, there were more questions for the tests.
Disadvantages of tests on proprietary data
Performance tests:
- They are not reproducible from the outside, because their launch requires closed data that cannot be published. For the same reason, some functional tests are not available to external users.
- Do not develop. There is a need for a significant expansion of their set, so that it is possible to verify in an isolated manner changes in the speed of individual parts of the system.
- They don’t run committably for pool requests, external developers can not check their code for performance regression.
You can solve these problems - throw out old tests and make new ones based on open data. Among the open data, you can take data on flights to the USA , taxi rides in New York, or use ready-made benchmarks TPC-H, TPC-DS, Star Schema Benchmark . The inconvenience is that this data is far from Yandex.Metrica data, and I would like to save test requests.
It is important to use real data.
You need to test the performance of the service only on real data from production. Let's look at a few examples.
Example 1
Suppose you populate a database with uniformly distributed pseudorandom numbers. In this case, data compression will not work. But data compression is an important property for analytical DBMSs. Choosing the right compression algorithm and the right way to integrate it into the system is a non-trivial task in which there is no one right solution, because data compression requires a compromise between the compression and decompression speed and the potential compression ratio. Systems that do not know how to compress data lose guaranteed. But if we use uniformly distributed pseudorandom numbers for tests, this factor is not considered, and all other results will be distorted.
Conclusion: test data should have a realistic compression ratio.
About optimization of data compression algorithms in ClickHouse I talked about in a previous post .
Example 2
Let us be interested in the speed of the SQL query:
SELECT RegionID, uniq(UserID) AS visitors
FROM test.hits
GROUP BY RegionID
ORDER BY visitors DESC
LIMIT 10
This is a typical request for Yandex.Metrica. What is important for its speed?
- How is GROUP BY performed ?
- What data structure is used to calculate the aggregate function uniq.
- How many different RegionIDs are and how much RAM each state of the uniq function requires.
But it is also very important that the amount of data for different regions is distributed unevenly. (Probably, it is distributed according to a power law. I built a graph in the log-log scale, but I'm not sure.) And if so, then it is important for us that the states of the uniq aggregate function with a small number of values use little memory. When there are many different aggregation keys, the count goes to units of bytes. How to get the generated data that have all these properties? Of course, it is better to use real data.
Many DBMSs implement the HyperLogLog data structure for the approximate calculation of COUNT (DISTINCT), but they all work relatively poorly because this data structure uses a fixed amount of memory. ClickHouse has a feature that uses a combination of three different data structures., depending on the size of the set.
Conclusion: test data should represent the properties of the distribution of values in the data - cardinality (the number of values in the columns) and mutual cardinality of several different columns.
Example 3
Well, let us test the performance not of the ClickHouse analytic DBMS, but of something simpler, for example, hash tables. For hash tables, choosing the right hash function is very important. For std :: unordered_map, it is slightly less important, because it is a chain-based hash table, and a prime number is used as the size of the array. In the standard library implementation in gcc and clang, a trivial hash function is used as the default hash function for numeric types. But std :: unordered_map is not the best choice when we want to get maximum speed. When using open-addressing hash tables, the correct choice of the hash function becomes a decisive factor, and we cannot use the trivial hash function.
It is easy to find hash table performance tests on random data, without considering the hash functions used. It is also easy to find hash function tests with an emphasis on computation speed and some quality criteria, however, in isolation from the data structures used. But the fact is that hash tables and HyperLogLog require different quality criteria for hash functions.
Read more about this in the report “How Hash Tables in ClickHouse Are Arranged” . It is a bit outdated since it does not consider swiss tables .
Task
We want to get the data for performance testing on the structure the same as Yandex.Metrica data, for which all the properties important for benchmarks are stored, but so that no traces of real site visitors are left in this data. That is, the data should be anonymized, and the following should be stored:
- compression ratios
- cardinality (number of different values),
- mutual cardinalities of several different columns,
- properties of probabilistic distributions by which data can be modeled (for example, if we believe that regions are distributed according to a power law, then the exponent - the distribution parameter - for artificial data should be approximately the same as for real).
And what is needed for the data to have a similar compression ratio? For example, if LZ4 is used, then the substrings in the binary data should be repeated at approximately the same distances and the repetitions should be approximately the same length. For ZSTD, a byte entropy match is added.
The maximum goal: to make a tool available to external people, with which everyone can anonymize their data set for publication. So that we debug and test performance on other people's data similar to data from production. And I would like it to be interesting to watch the generated data.
This is an informal statement of the problem. However, no one was going to give a formal statement.
Attempts to solve
The importance of this task should not be exaggerated for us. In fact, it was never in the plans and no one was even going to do it. I simply did not give up hope that something would come up with itself, and suddenly I would have a good mood and a lot of things that could be postponed until later.
Explicit Probabilistic Models
The first idea is to select a family of probability distributions that models it for each column of the table. Then, based on the statistics of the data, select the model fitting parameters and generate new data using the selected distribution. You can use a pseudo-random number generator with a predefined seed to get a reproducible result.
For text fields, you can use Markov chains - an understandable model for which you can make an effective implementation.
True, some tricks are required:
- We want to preserve the continuity of time series - which means that for some data types it is necessary to model not the value itself, but the difference between the neighboring ones.
- In order to simulate the conditional cardinality of columns (for example, that there are usually few IP addresses per visitor identifier), you will also need to write down the dependencies between the columns explicitly (for example, to generate an IP address, a hash from the visitor identifier is used, but also some other pseudo-random data is added).
- It is unclear how to express the dependence that one visitor at about the same time often visits URLs with matching domains.
All this is presented in the form of a program in which all distributions and dependencies are hard coded - the so-called “C ++ script”. However, Markov models are still calculated from the sum of statistics, smoothing and roughening using noise. I started writing this script, but for some reason, after I explicitly wrote the model for ten columns, it suddenly became unbearably boring. And in the hits table in Yandex.Metrica back in 2012 there were more than 100 columns.
EventTime.day(std::discrete_distribution<>({
0, 0, 13, 30, 0, 14, 42, 5, 6, 31, 17, 0, 0, 0, 0, 23, 10, ...})(random));
EventTime.hour(std::discrete_distribution<>({
13, 7, 4, 3, 2, 3, 4, 6, 10, 16, 20, 23, 24, 23, 18, 19, 19, ...})(random));
EventTime.minute(std::uniform_int_distribution(0, 59)(random));
EventTime.second(std::uniform_int_distribution(0, 59)(random));
UInt64 UserID = hash(4, powerLaw(5000, 1.1));
UserID = UserID / 10000000000ULL * 10000000000ULL
+ static_cast(EventTime) + UserID % 1000000;
random_with_seed.seed(powerLaw(5000, 1.1));
auto get_random_with_seed = [&]{ return random_with_seed(); };
This approach to the task was a failure. If I approached her more diligently, for sure the script would have been written.
Advantages:
- ideological simplicity.
Disadvantages:
- the complexity of implementation,
- the implemented solution is suitable for only one type of data.
And I would like a more general solution - so that it can be applied not only to Yandex.Metrica data, but also to obfuscate any other data.
However, improvements are possible here. You can not manually select models, but implement a catalog of models and choose the best among them (best fit + some kind of regularization). Or maybe you can use Markov models for all types of fields, and not just for text. Dependencies between data can also be understood automatically. To do this, you need to calculate the relative entropies (relative amount of information) between the columns, or more simply, the relative cardinalities (something like “how many different A values on average for a fixed value of B”) for each pair of columns. This will make it clear, for example, that URLDomain is completely dependent on the URL, and not vice versa.
But I also refused this idea, because there are too many options for what needs to be taken into account, and it will take a long time to write.
Neural networks
I already told how important this task is for us. No one even thought about making any creeps to its implementation. Fortunately, colleague Ivan Puzyrevsky then worked as a teacher at the HSE and at the same time was developing the YT core. He asked if I had any interesting tasks that could be offered to students as a diploma topic. I offered him this one and he assured me that it was suitable. So I gave this task to a good person “from the street” - Sharif Anvardinov (the NDA is signed for working with data).
He told him about all his ideas, but most importantly, he explained that the problem can be solved in any way. And just a good option would be to use those approaches that I don’t understand at all: for example, generate a text data dump using LSTM. It looked encouraging thanks to the article The Unreasonable Effectiveness of Recurrent Neural Networks , which then caught my eye.
The first feature of the task is that you need to generate structured data, not just text. It was not obvious whether a recurrent neural network would be able to generate data with the desired structure. There are two ways to solve this. The first is to use separate models for generating the structure and for the “filler”: the neural network should only generate values. But this option was postponed until later, after which they never did. The second way is to simply generate a TSV dump as text. Practice has shown that in the text part of the lines will not correspond to the structure, but these lines can be thrown out at loading.
The second feature - a recurrent neural network generates a sequence of data, and the dependencies in the data can only follow in the order of this sequence. But in our data, perhaps the order of the columns is reversed with respect to the dependencies between them. We did not do anything with this feature.
Toward summer, the first working Python script appeared that generated data. At first glance, the data quality is decent:
True, difficulties were revealed:
- The size of the model is about a gigabyte. And we tried to create a model for data whose size was on the order of several gigabytes (for starters). The fact that the resulting model is so large raises concerns: suddenly it will be possible to get real data from it on which it was trained. Probably not. But I don’t understand machine learning and neural networks and I have not read the Python code from this person, so how can I be sure? Then there were articles about how to compress neural networks without losing quality, but this was left without implementation. On the one hand, this does not look like a problem - you can refuse to publish the model and publish only the generated data. On the other hand, in the case of retraining, the generated data may contain some part of the source data.
- On a single machine with a CPU, the data generation rate is approximately 100 lines per second. We had a task - to generate at least a billion lines. The calculation showed that this could not be done before the defense of the diploma. And using other hardware is not practical, because I had a goal - to make the data generation tool available for wide use.
Sharif tried to study data quality by comparing statistics. For example, I calculated the frequency with which different symbols appear in the source and in the generated data. The result was stunning - the most common characters are Ð and Ñ.
Do not worry - he defended his diploma perfectly, after which we safely forgot about this work.
Compressed Data Mutation
Suppose that the statement of the problem is reduced to one point: you need to generate data for which the compression ratios will be exactly the same as the original data, while the data must be uncompressed at exactly the same speed. How to do it? Need to edit bytes of compressed data directly! Then the size of the compressed data will not change, but the data itself will change. Yes, and everything will work quickly. When this idea appeared, I immediately wanted to implement it, despite the fact that it solves some other problem instead of the original one. It always happens.
How to edit a compressed file directly? Suppose we are only interested in LZ4. Data compressed using LZ4 consists of two types of instructions:
- Literals: copy the next N bytes as is.
- Match (match, the minimum repetition length is 4): repeat N bytes that were in the file at a distance of M.
Background:
Hello world Hello
. Compressed data (conventionally)
literals 12 "Hello world " match 5 12
. In the compressed file, leave match as is, and in literals we will change the byte values. As a result, after decompression, we obtain a file in which all repeating sequences of at least 4 length are also repeated, and repeated at the same distances, but at the same time consist of a different set of bytes (figuratively speaking, not a single byte was taken from the original file in the modified file )
But how to change bytes? This is not obvious because, in addition to column types, data also has its own internal, implicit structure, which we would like to preserve. For example, text is often stored in UTF-8 encoding - and we want valid UTF-8 in the generated data too. I made a simple heuristic to satisfy several conditions:
- null bytes and ASCII control characters were stored as is,
- some punctuation persisted,
- ASCII was translated into ASCII, and for the rest, the most significant bit was saved (or you can explicitly write an if set for different UTF-8 lengths). Among one class of bytes, a new value is selected evenly randomly;
- and also to keep fragments
https://
and similar, otherwise everything looks too stupid.
The peculiarity of this approach is that the initial data themselves act as a model for the data, which means that the model cannot be published. It allows you to generate the amount of data no more than the original. For comparison, in previous approaches it was possible to create a model and generate an unlimited amount of data on its basis.
Example for URL: The result pleased me - it was interesting to look at the data. But still, something was wrong. The URLs are still structured, but in some places it was too easy to guess yandex or avito. Made a heuristic that sometimes rearranges some bytes in some places.
http://ljc.she/kdoqdqwpgafe/klwlpm&qw=962788775I0E7bs7OXeAyAx
http://ljc.she/kdoqdqwdffhant.am/wcpoyodjit/cbytjgeoocvdtclac
http://ljc.she/kdoqdqwpgafe/klwlpm&qw=962788775I0E7bs7OXe
http://ljc.she/kdoqdqwdffhant.am/wcpoyodjit/cbytjgeoocvdtclac
http://ljc.she/kdoqdqwdbknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu_qxht
http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu_qxht
http://ljc.she/kdoqdqw-bknvj.s/hmqhpsavon.yf#aortxqdvjja
http://ljc.she/kdoqdqwpdtu-Unu-Rjanjna-bbcohu-702130
Other considerations worried. For example, sensitive information can be represented in a column of type FixedString in binary form and for some reason consist of ASCII control characters and punctuation, which I decided to save. And I do not consider data types.
Another problem: if data of the type “length, value” is stored in one column (this is how columns of type String are stored), then how to ensure that the length remains correct after the mutation? When I tried to fix it, the task immediately became uninteresting.
Random permutations
But the problem is not solved. We conducted several experiments, and it only got worse. It remains only to do nothing and read random pages on the Internet, because the mood is already spoiled. Fortunately, on one of these pages was a parsing of the algorithm for rendering the scene of the death of the protagonist in the game Wolfenstein 3D.
Beautiful animation - the screen fills with blood. The article explains that this is actually a pseudo-random permutation. Random permutation - a randomly selected one-to-one transformation of a set. That is, the display of the whole set in which the results for different arguments are not repeated. In other words, this is a way to iterate over all the elements of a set in a random order. It is such a process that is shown in the picture - we paint over each pixel, randomly selected from all, without repetition. If we just chose a random pixel at every step, then it would take us a long time to paint over the last one.
The game uses a very simple algorithm for pseudo-random permutation - LFSR(linear feedback shift register). Like pseudo-random number generators, random permutations, or rather their families, parameterized by the key, can be cryptographically strong - this is exactly what we need to convert the data. Although there may be unobvious details. For example, cryptographically strong encryption of N bytes to N bytes with a pre-fixed key and initialization vector, it would seem, can be used as a pseudo-random permutation of many N-byte strings. Indeed, such a transformation is one-to-one and looks random. But if we use the same transformation for all our data, then the result may be subject to analysis, because the same initialization vector and key values are used several times. This is similar toElectronic Codebook work block ciphers.
What are the ways to get a pseudo-random permutation? You can take simple, one-to-one transformations and assemble a fairly complex function from them that will look random. Let me give you an example of some of my favorite one-to-one transformations:
- multiplication by an odd number (e.g. a large prime) in two's complement arithmetic,
- shift-xor:
x ^= x >> N
, - CRC-N, where N is the number of bits of the argument.
So, for example, from three multiplications and two xor-shift operations, the murmurhash finalizer is assembled . This operation is a pseudo-random permutation. But just in case, I note that hash functions, even N bits in N bits, do not have to be mutually unique.
Or here's another interesting example from elementary number theory from the site of Jeff Preshing.
How can we use pseudo-random permutations for our task? You can convert all numeric fields using them. Then it will be possible to preserve all the cardinalities and mutual cardinalities of all combinations of fields. That is, COUNT (DISTINCT) will return the same value as before the conversion, and with any GROUP BY.
It is worth noting that the preservation of all cardinalities is slightly contrary to the statement of the problem of data anonymization. For example, we know that in the initial data on site visits there is a user who visited sites from 10 different countries, and we want to find him in the converted data. But in the converted data, he also visited websites from 10 different countries - this is important information that will narrow the search. Even if we find what the user has converted to, it will not be very useful, because all the other data has been converted too - for example, we will not be able to understand which sites he visited. But such rules can be applied in a chain. For example, we can know a priori that the most popular site in our data is Yandex, and in second place Google, and only by maintaining the ranks understand that some converted site identifiers actually mean Yandex and Google. But this is not surprising - I remind you that the statement of the problem is not formal, and we want to find some balance between anonymizing data (hiding information) and preserving their properties (disclosing information). For information on how to more reliably approach the task of data anonymization, readarticle .
In addition, I want to leave as the source not only the cardinality of the values, but also the approximate orders of magnitude. For example, if the source data were numbers less than 10, then after the conversion there will also be small numbers. How to achieve this?
For example, let's try to break the set of possible values into size classes and perform permutations within each class separately (size classes remain). For this, the easiest way to take for size class is the nearest power of two or the position of the high bit of the number, which is the same. The numbers 0 and 1 will always remain as they are. The numbers 2 and 3 with a probability of 1/2 will sometimes remain as they are and with a probability of 1/2 they will be interchanged with each other. The set of numbers 1024..2047 will be rearranged by one of 1024! (factorial) options. Etc. For signed numbers, you can save the sign as is.
It is also not obvious whether we need a one-to-one function. You can probably just use a cryptographically strong hash function. The transformation will turn out not to be ambiguous, but the cardinalities will not differ much.
True, we need a cryptographically strong random permutation so that we can select a key and obtain a permutation with this key so that it is difficult to restore the original data from the rearranged data without knowing the key.
But there is a problem - I not only understand nothing in neural networks and machine learning, but in cryptography I am also an amateur. Only courage remains. At this time, I just continued to read random pages on the Internet. The Fabien Sanglard page had a link from Hackers News , and there were comments with a discussion. They included a link to a Redis developer blog article , Salvatore Sanfilippo. It said that to obtain random permutations there is a wonderful generalized method called the Feistel network.
Feistel's network consists of rounds. Each round is an amazing transformation that allows you to get a one-to-one function from any function. Let's see how it works.
- The argument bits are split into two halves:
arg: xxxxyyyy
arg_l: xxxx
arg_r: yyyy - The left half is replaced by the right. And in place of the right is xor from the left half and the function from the right half:
res: yyyyzzzz
res_l = yyyy = arg_r
res_r = zzzz = arg_l ^ F(arg_r)
There is also a statement that if we take a cryptographically stable pseudo-random function as F and apply the Feistel round at least 4 times, we get a cryptographically stable pseudo-random permutation.
It looks like a miracle: we take a function that gives out random garbage based on data, we substitute it into the Feistel network - and now we have a function that gives out random garbage based on data, but it is completely reversible!
Feistel's network is at the heart of some data encryption algorithms. And what we will do will just become a kind of encryption, but only very bad. There are two reasons for this:
- we encrypt individual values independently, in the same way, similarly to Electronic Codebook mode of operation;
- we store information about the order of magnitude of the value (the nearest power of two) and its sign, and this leads to the fact that some values do not change at all.
So we can obfuscate numerical fields while maintaining the properties we need. For example, the compression ratio using LZ4 should remain approximately the same, because repeated values in the original data will be repeated in the converted data, moreover, at the same distances from each other.
Markov models
For problems of data compression, predictive input, speech recognition, and also for generating random strings, text models are used. The text model is the probability distribution of all possible lines. Imagine, for example, an imaginary probabilistic distribution of the texts of all books that humanity can write. To generate a string, we simply play a random variable with this distribution and return the resulting string (a random book that humanity can write). But how do we know this probability distribution of all possible strings?
First, in explicit form it would require too much information: for example, there are 256 ^ 10 of all possible rows of 10 bytes in length, and it would take quite a lot of memory to explicitly write a table with the probability of each row. Secondly, we do not have enough statistics to reliably estimate this distribution.
Therefore, a probability distribution is used as a text model, which is obtained from rougher statistics. For example, you can separately calculate the probability with which each letter occurs in the text. And then generate the lines, choosing each subsequent letter with such a probability. Such a primitive model works, but the strings are still very unnatural.
For a slight improvement of this model, it is possible to have the probability of not only each individual letter, but also the conditional probability of occurrence of the letter - provided that it has N some previous letters. N is a pre-fixed constant. For example, N = 5, and we consider the probability of meeting the letter “e” after the letters “squeeze”. Such a text model is called the Order-N Markov model. You can observe the work of Markov models on the Yandex.Referat website
P(мама | мам) = 0.9
P(мамб | мам) = 0.05
P(мамв | мам) = 0.01
...
(former vesna.yandex.ru). Unlike LSTM neural networks, they only have enough memory for a small context of a fixed length N, so they generate funny, incoherent text. Markov models are also used in primitive methods of spam generation - generated texts can be easily distinguished from real ones by counting statistics that go beyond the model. Note the advantage: Markov models work orders of magnitude faster than neural networks, and this is what we need.
Example for Title:
Fashionable Pyshki - Informer.Ru - national accident in express fixicapriza, traffic jams in Novosting. Trial online nails, but not a pyruvod, electroshock frames at a discount or - Yandex.Afisha Mail.Ru - fashionable dating - online store STRETBOLKU in online cars. Odessa) - AUTO.ria.ua Bazar Car Magnetomotor movies online free and video
So, we can calculate statistics on the source data, create a Markov model and generate new data on its basis. Of the nuances, we immediately note the need for coarsening (smoothing) the model so as not to disclose information about rare combinations in the source data - this is not a problem. I use a combination of order models from 0 to N, where in case of insufficient statistics for a model of order N, a model of order N - 1 is used).
But we still want to keep the data cardinal. For example, if there were 123,456 unique URL values in the source data, then even as a result there will be about the same. To do this, as a random number generator for playing out random variables, we will use an initialized random number generator in a deterministic way. The easiest way to do this is to use a hash function. We will apply it to the initial value. That is, we get a pseudo-random result, uniquely determined by the initial value.
Another requirement: let there be a lot of different URLs in the source data that start with the same prefix, but go on differently. For example:
https://www.yandex.ru/images/cats/?id=xxxxxx
. We want the result to be URL addresses that also begin with the same prefix, but on a different one. For instance:http://ftp.google.kz/cgi-bin/index.phtml?item=xxxxxx
. Here, as a random number generator for generating the next character using the Markov model, we will take a hash function not from the entire source string, but from a moving window of 8 bytes at a given position.
It turns out just what you need. Let's look at an example of page titles:https://www.yandex.ru/images/cats/?id=12345
^^^^^^^^
distribution: [aaaa][b][cc][dddd][e][ff][ggggg][h]...
hash("images/c") % total_count: ^
http://ftp.google.kz/cg...
Progradar-children pregnant with departure or Dachna Nevestika and Moscow Region | Colds. - Posters put in Accessory Progradar-children coast - Yandex.Money: Pay magazine five bikes on Lore - dfghf - ya.ru - real estate in Moscow) by 473,682 advertisements - Sale of Comprom Progradar children free! in a big assot ”in Moscow - Embroidery - Omsk Pay in Most Sandals growth Progradar children free! in a big assot "in Moscow Progradar-children Berdyansk Mods. Recipe c photo gallery and covered up the loud football of Moskovye Mail.Ru - Search Progradar-Children Berezhnova sale See the best price, community 2010 | The design laboratory is BASIC. Dosta Progradar children free! in a large assotrud Price: 820 0000 km., Tagande apartments in St. Pete Progradar-children cherished for months - DoramaTv.ru - Dresses all over the world. Online sale of cars used and at a discount Progradar-children Beremhovoy room. in 2013, used cars in Stavropavlins and strollers -> Magnitaz 80 cell.RU Progradar-children are protected - Official Forums Kalinin (Ukraine. Author Baksler Kudryavtseva delivery, job offers, hotel sales Progradar-children pregnancy detailed 5, 69, communication W * stable - Yandex.Maps, home, what price is the World of Tanks game forum Progradar-children of Berets, Fatherland and in pink page 2 of the office Progradar-children Take care of the prograd in China, Verta Bar, weather Manika. Entries in Smolensk
Result
After four tried and tested methods, the problem bothered me so much that it was time to choose an option, place it in the form of a convenient tool and finally declare it a solution. I chose a solution using random permutations and Markov models parameterized with a key. It is designed as a clickhouse obfuscator program, which is very easy to use: a table dump is sent to the input in any supported format (for example, CSV, JSONEachRow), the table structure (column names and types) and the secret key (arbitrary string) are specified in the command line parameters , it can be forgotten immediately after use). At the output we get the same number of lines of obfuscated data.
The program is installed with clickhouse-client, has no dependencies and works on almost any Linux. You can apply it to dumps of any databases, not just ClickHouse. For example, you can use it to generate test data from MySQL or PostgreSQL databases, as well as to create development databases similar to production.
clickhouse obfuscator \
--seed "$(head -c16 /dev/urandom | base64)" \
--input-format TSV --output-format TSV \
--structure 'CounterID UInt32, URLDomain String, \
URL String, SearchPhrase String, Title String' \
< table.tsv > result.tsv
clickhouse obfuscator --help
Not everything is so simple, because the data conversion by this program is almost completely reversible. Is it possible to carry out the inverse transformation without knowing the key? If the conversion were carried out by a cryptographic algorithm, then the complexity of such an operation would be the same as the complexity of a complete search. And although some cryptographic primitives are used for conversion, they are used in the wrong way, and the data is subject to some analysis methods. To avoid problems, the documentation for the program (available on
--help
) lists these features. We converted the dataset we need for functional tests , and the director according to Yandex approved his publication.
clickhouse-datasets.s3.yandex.net/hits/tsv/hits_v1.tsv.xz
clickhouse-datasets.s3.yandex.net/visits/tsv/visits_v1.tsv.xz
Developers outside Yandex use this data for real performance tests when optimizing algorithms inside ClickHouse. Third-party users can provide us with their data in an obfuscated form so that we can make ClickHouse even faster for them.