
Methods for optimizing application performance when working with RDBs
They work everywhere - even MySQL, even Oracle, even a self-written database. The smarter the database - the more she tries to optimize herself, but it’s better to help her
1. Divide and conquer, but simply clustering the database — all data of the same type can still be divided into clusters — separate tables, each table contains records that satisfy some simple rule, for example, the table with index I contains data with ID% N == I, where N is the number of clusters. Thus, we very simply and effectively divide the data that does not need to be read sequentially - for example, we divide all the words into 100-200 million blocks, in each block only words with ID% N == I. As an example, in a large system, such as a social network, you can divide all the data by the attribute of belonging to one user - for example, put all photos in N tables, information about the photo is placed in the table K = USER_ID% N
2. Conditionally - work with a disk. Always write (paste) sequentially, cache and buffer the recording, try to read in a row from beginning to end. Speeding up the recording can be just fantastic - many orders of magnitude, simply because you use the recording correctly knowing how your (or manufacturer's) write algorithm to disk works. Data can almost always be sorted before writing - whether in memory, different files with pieces of text - you can always build an index or a simple array that is sorted by data ID and read-write them in the same order as in the index. As one of the options - you can always come up with a more optimal data storage structure. For example, when you need to insert a piece of a table into another table, it is better to do it sequentially from a smaller ID to a larger one, at the same time disabling the indexing mechanism. And turning it on after insertion.
3. Do not store what you need frequently on a disk - load it into memory. Now you can easily put gigabytes in the memory, or even two. If everything does not fit, break the data into pieces and do the work in one piece. No memcached and analogues will help in such a task, only you know in what order the data will be processed, so you can write a solution ten times faster than standard utilities. Many "old" structures are well suited for using data.
- hashing (all data is divided into parts in accordance with a certain rule, for example CRC% N)
- AVL and B trees (I highly recommend taking paper, pen, C / C ++ for your loved one and implementing them yourself following only the definitions without reading the algorithm - develop yourself. Develops ingenuity and raises your overall level)
- Heap with ordered sampling
4. Use pointers and related structures. An example from social networks: often out of 100 million photos it is difficult to find those on which a particular person is marked, but in the information about the person himself it is easy to keep a list of such photos. Similarly: it is impossible to store in memory all the links to pages that are in the search, but the url of the root of the site is easily placed. By calculating the full CRC or root ID (in any way) using the full link, you can quickly find the place in the file where all known links on this site are written and check if this one exists.
I will try to add all new nuances to this post, with the exception of very simple or stupid rules. I will not describe how to set indexes in MySQL - for this there are hundreds of manuals and many ways to check if your solution really works. If you still don’t know such simple things, then I’m not sure that the information in this post was useful ...
And also: I am of the opinion that there is something to optimize in any task, and if this is critical, do optimization and, at the same time, do your own education on this topic.
The full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/
1. Divide and conquer, but simply clustering the database — all data of the same type can still be divided into clusters — separate tables, each table contains records that satisfy some simple rule, for example, the table with index I contains data with ID% N == I, where N is the number of clusters. Thus, we very simply and effectively divide the data that does not need to be read sequentially - for example, we divide all the words into 100-200 million blocks, in each block only words with ID% N == I. As an example, in a large system, such as a social network, you can divide all the data by the attribute of belonging to one user - for example, put all photos in N tables, information about the photo is placed in the table K = USER_ID% N
2. Conditionally - work with a disk. Always write (paste) sequentially, cache and buffer the recording, try to read in a row from beginning to end. Speeding up the recording can be just fantastic - many orders of magnitude, simply because you use the recording correctly knowing how your (or manufacturer's) write algorithm to disk works. Data can almost always be sorted before writing - whether in memory, different files with pieces of text - you can always build an index or a simple array that is sorted by data ID and read-write them in the same order as in the index. As one of the options - you can always come up with a more optimal data storage structure. For example, when you need to insert a piece of a table into another table, it is better to do it sequentially from a smaller ID to a larger one, at the same time disabling the indexing mechanism. And turning it on after insertion.
3. Do not store what you need frequently on a disk - load it into memory. Now you can easily put gigabytes in the memory, or even two. If everything does not fit, break the data into pieces and do the work in one piece. No memcached and analogues will help in such a task, only you know in what order the data will be processed, so you can write a solution ten times faster than standard utilities. Many "old" structures are well suited for using data.
- hashing (all data is divided into parts in accordance with a certain rule, for example CRC% N)
- AVL and B trees (I highly recommend taking paper, pen, C / C ++ for your loved one and implementing them yourself following only the definitions without reading the algorithm - develop yourself. Develops ingenuity and raises your overall level)
- Heap with ordered sampling
4. Use pointers and related structures. An example from social networks: often out of 100 million photos it is difficult to find those on which a particular person is marked, but in the information about the person himself it is easy to keep a list of such photos. Similarly: it is impossible to store in memory all the links to pages that are in the search, but the url of the root of the site is easily placed. By calculating the full CRC or root ID (in any way) using the full link, you can quickly find the place in the file where all known links on this site are written and check if this one exists.
I will try to add all new nuances to this post, with the exception of very simple or stupid rules. I will not describe how to set indexes in MySQL - for this there are hundreds of manuals and many ways to check if your solution really works. If you still don’t know such simple things, then I’m not sure that the information in this post was useful ...
And also: I am of the opinion that there is something to optimize in any task, and if this is critical, do optimization and, at the same time, do your own education on this topic.
The full content and list of my articles on the search engine will be updated here: http://habrahabr.ru/blogs/search_engines/123671/