olku June 1, 2014 at 14:53

We derive MySQL from an environment

As soon as your information system becomes operational, there is a need to have at least two copies of its database. The first, backup, with some frequency is created using standard utilities and is a consistent dump (consistent dump). The purpose of its creation is to restore the system after a failure (disaster recovery).

We will consider the creation of the second copy necessary to continue working on the project. The article is aimed at developers who are just entering the thorny path of quality management. And it is useless to those who already know that the “second copy” is actually not quite the second, and not quite the copy.

Divide and rule

The industry recommends 3-4 independent environments for information systems. In addition to the aforementioned worker, these are environments: development (development), testing (test) and simulation (staging).

Since the deployment and maintenance of each environment are direct costs, their application is determined by the specifics of the project. The higher the cost of error in a working environment, the more environments are included in the development process. And vice versa. The last two can be replaced by one testing environment, if the process does not provide for separate acceptance tests or a demonstration. Similarly, all three can be the same development environment if the process does not involve regression testing or continuous integration. In the limit, all three may not exist at all. For example, for a static business card site of an unknown company updated by ftp.

In addition to games with functionality and troubleshooting, the presence of independent environments allows you to scale the project team. When a programmer programs and an administrator administers, that’s good. This is called segregation of duties. It is recommended to lock the developer to the development environment, and the tester to the testing environment. It is believed that such an organizational and technical approach reduces operational risks (operational risk), therefore, very happy owners of information systems.

So what's up with the second copy? The easiest and most obvious way to get it is to copy the backup. And from it already raise any additional environment where necessary.

You can’t just take it

and copy the database from the working environment to the development or testing environment.

The fact is that at present the key risk of information systems is the leakage of non-public data. For example, according to Verizon's 2013 Data Breach Investigations Report, out of 47,000 incidents, 69% of data leaks were detected by third parties. That is, a data leak is detected somewhere by outsiders, including our customers. And more than half of the cases of insider occur in former employees who used their forgotten active accounts or backdoors.

Consequently, direct access to the work environment should only be for those who work directly with it. For example, a database administrator. But not all programmers of the company “because it is easier to debut” in any way. Yes, the non-disclosure agreement (NDA) in the contract is a good preventive measure. But only organizational, and therefore insufficient. Organizational measures should be supported by technical control.

Data becomes more expensive to maintain.

More and more protection requirements are imposed on them. From voluntary compliance (for example, ISO27K standards) to certification by a local regulator (in the EU it is Office of the Data Protection). Yes, in the end, it all comes down to minimizing the damage to the company from non-public data leakage. Moreover, the protection of data about people has already overshadowed the protection of trade secrets. And the use of cloud storage is not known where and their maintenance is unknown by anyone, it only adds to the problem of acuteness.

The standard practice of protecting data outside the work environment is to anonymize it (sanitization). Data to be protected is grouped by type. For example, bank account number, date of birth, last name, encrypted password. Then one of the anonymization techniques is applied to them - full or partial masking, cleaning, mixing, spoofing, encryption or hashing. We will not consider them here, this is the topic of a separate article. As a bonus, if you write anonymization of accounts once, pay attention to Sample Account Data. There are both valid and invalid bank data by country.

Data is getting bigger

The manual test base deployment method is:

copying the dump of the working base to the test server;
restoring a database from it;
execution of the anonymization script.

To transfer a "clean" database further, to the development environment or to the outside, a backup copy is made.

The advantages of this approach are obvious: high speed at small volumes and consistency of relational output data. But there are also disadvantages, which are manifested in full growth on volumes from tens of gigabytes:

An intermediate dump file is used that is transmitted over the network.
a full copy is usually redundant for development or testing purposes;
prior to anonymization, data is subject to leakage both from the dump file and from the database;
The anonymization script is a separate element; in case of an error, the data remains the same.

Decision

In preparation for the Java exam, the org.crystalcopy utility for MySQL was written, devoid of these shortcomings and advantages.
The utility supports tables, records, triggers, stored procedures, indexes and foreign keys. Suitable for:

partial copying of the database to another database or file;
full copy of the database to another database or file;
copying only the database schema to another database or file;
anonymization using SQL without transferring data over the network.

With partial copying, the relational relationships between the fields are preserved, but the consistency of all records is not guaranteed. The speed of full copying from one database to another on my machine is about 2GB per hour. The utility is launched from the command line; it can be integrated into the continuous integration build. Available on the Internet under a non-commercial license.

Status is beta. So I will be happy to receive your feedback!

Tags: