A little bit about improving database performance: practical tips



    / photo Ozzy Delaney CC

    In 1cloud we talk a lot about our own experience working with a virtual infrastructure provider and the intricacies of organizing internal processes. Today we decided to talk a bit about database optimization.

    Many DBMSs can not only store and manage data, but also execute code on the server. An example of this is stored procedures and triggers. However, just one data modification operation can trigger several triggers and stored procedures, which, in turn, will “wake up” a couple more. An example is cascading deletion in SQL databases when excluding one row in a table changes many other related records.

    Obviously, you should use enhanced functionality carefully so as not to load the server, because all this can affect the performance of client applications using this database.

    Take a look at the chart below. It shows the results of application load testing, when the number of users (blue graph) working with the database gradually increases to 50. The number of requests (orange) that the system can handle quickly reaches its maximum and stops growing, while the response time (yellow) is gradually increasing.



    When working with large databases, even the slightest change can have a serious impact on performance, both in a positive and negative way. In medium and large organizations, the administrator is responsible for setting up databases, but often these tasks fall on the shoulders of the developers. Therefore, below we will give some practical tips that will help improve the performance of SQL databases.

    Use indexes

    Indexing is an efficient way to tune a database that is often neglected during development. The index speeds up queries by providing quick access to rows of data in a table, similar to how an index in a book helps you quickly find the information you want.

    For example, if you create an index by a primary key, and then look for a data row using the primary key values, then the SQL server will first find the index value and then use it to quickly find the data row. Without an index, a full scan of all rows of the table will be performed, and this is a waste of resources.

    However, it is worth noting that if your tables are “bombarded” with the INSERT, UPDATE, and DELETE methods, you need to be careful about indexing - it can lead to poor performance, since after performing the above operations all indexes must be changed.

    Moreover, when you need to add a large number of rows to the table (for example, more than a million) at once, DBAs often dump indexes to speed upinsertion process (indexes are recreated after insertion). Indexing is an extensive and interesting topic, for which a brief description is not enough. You can find more information on this topic here .

    Do not use loops with a large number of iterations.

    Imagine a situation when 1000 queries sequentially arrive at your database:

    for (int i = 0; i < 1000; i++)
    {
        SqlCommand cmd = new SqlCommand("INSERT INTO TBL (A,B,C) VALUES...");
        cmd.ExecuteNonQuery();
    }
    

    Such cycles are not recommended . The above example can be redone using one INSERT or UPDATE with several parameters:

    INSERT INTO TableName (A,B,C) VALUES (1,2,3),(4,5,6),(7,8,9)
    UPDATE TableName SET A = CASE B
            WHEN 1 THEN 'NEW VALUE'
            WHEN 2 THEN 'NEW VALUE 2'
            WHEN 3 THEN 'NEW VALUE 3'
        END
    WHERE B in (1,2,3
    )
    

    Make sure the WHERE operation does not overwrite the same values. Such a simple optimization can speed up the execution of the SQL query, reducing the number of rows updated from thousands to hundreds. Verification Example:

    UPDATE TableName
    SET A = @VALUE
    WHERE
          B = 'YOUR CONDITION'
                AND A <> @VALUE – VALIDATION
    

    Avoid correlating subqueries A

    correlating subquery is a subquery that uses parent query values. It is executed line by line , once for each row returned by the external (parent) request, which reduces the speed of the database. Here is a simple example of a correlating subquery:

    SELECT c.Name, 
           c.City,
           (SELECT CompanyName FROM Company WHERE ID = c.CompanyID) AS CompanyName 
    FROM Customer c
    

    The problem here is that an internal query (SELECT CompanyName ...) is executed for each row that the external query returns (SELECT c.Name ...). To improve performance, you can rewrite the subquery via JOIN:

    SELECT c.Name, 
           c.City, 
           co.CompanyName 
    FROM Customer c 
    	LEFT JOIN Company co
    		ON c.CompanyID = co.CompanyID
    

    Try not to use SELECT *

    Try not to use SELECT *! Instead, it’s worth connecting each column individually. It sounds simple, but many developers stumble at this point. Imagine a table with hundreds of columns and millions of rows. If your application needs only a few columns, it makes no sense to query the entire table - this is a big waste of resources.

    For example, which is better: SELECT * FROM Employees or SELECT FirstName, City, Country FROM Employees?

    If you really need all the columns, specify each explicitly. This will help to avoid errors and additional database settings in the future. For example, if you use INSERT ... SELECT ... and a new column appears in the source table, errors may occur even if this column is not needed in the destination table:

    INSERT INTO Employees SELECT * FROM OldEmployees
    Msg 213, Level 16, State 1, Line 1
    Insert Error: Column name or number of supplied values does not match table definition.
    

    To avoid such errors, you need to register each column:

    INSERT INTO Employees (FirstName, City, Country)
    SELECT Name, CityName, CountryName
    FROM OldEmployees
    

    However, it is worth noting that there are situations in which the use of SELECT * is acceptable. An example is temporary tables.

    Use temporary tables wisely

    Temporary tables often complicate the structure of a query. Therefore, it is better not to use them if it is possible to issue a simple request.

    But if you are writing a stored procedure that performs some actions with data that cannot be formatted in one query, then use temporary tables as “intermediaries” to help you get the final result.

    Let's say you need to make a selection with conditions from a large table. To increase database performance, you should transfer your data to a temporary table and perform JOIN with it already. The temporary table will be smaller than the original, so the join will happen faster.

    It’s not always clear what the difference is between temporary tables and subqueries. Therefore, we give an example: imagine a customer table with millions of records from which you need to make a selection by region. One implementation option is to use SELECT INTO followed by concatenation into a temporary table:

    SELECT * INTO #Temp FROM Customer WHERE RegionID = 5
    SELECT r.RegionName, t.Name FROM Region r JOIN #Temp t ON t.RegionID = r.RegionID
    

    But instead of temporary tables, you can use a subquery:

    SELECT r.RegionName, t.Name FROM Region r 
    JOIN (SELECT * FROM Customer WHERE RegionID = 5) AS t 
    ON t.RegionID = r.RegionID
    

    In the previous paragraph, we discussed that it is worth writing only the columns we need in a subquery, therefore:

    SELECT r.RegionName, t.Name FROM Region r 
    JOIN (SELECT Name, RegionID FROM Customer WHERE RegionID = 5) AS t 
    ON t.RegionID = r.RegionID
    

    Each of the three examples will return the same result, but in the case of temporary tables, you get the opportunity to use indexing. For a more complete understanding of the working principles of temporary tables and subqueries, you can read the topic on Stack Overflow.

    When the work with the temporary table is finished, it is better to delete it and release the tempdb resources than wait until the automatic deletion occurs (when your connection to the database server closes):

    DROP TABLE #temp

    Use EXISTS ().

    If you want to verify the existence of a record, it is better to use the EXISTS () operator instead of COUNT (). Whereas COUNT () goes through the whole table, EXISTS () stops working after finding the first match. This approach improves performance and improves code readability:

    IF (SELECT COUNT(1) FROM EMPLOYEES WHERE FIRSTNAME LIKE '%JOHN%') > 0
     	PRINT 'YES' 
    

    or

    IF EXISTS(SELECT FIRSTNAME FROM EMPLOYEES WHERE FIRSTNAME LIKE '%JOHN%')
        PRINT 'YES'
    

    Instead of a conclusion


    Application users love it when they don’t need to look at the download icon for a long time, when everything works smoothly and quickly. Applying the techniques described in this material will allow you to improve database performance, which will positively affect user experience.

    I would like to summarize and repeat the key points described in the article:

    1. Use indexes to speed up searching and sorting.
    2. Do not use loops with a lot of iterations to insert data - use INSERT or UPDATE.
    3. Avoid correlating subqueries.
    4. Limit the number of parameters of the SELECT statement - specify only the necessary tables.
    5. Use temporary tables only as “intermediaries” to join large tables.
    6. To check for a record, use the EXISTS () operator, which terminates after determining the first match.

    If you are interested in the topic of database performance, then there is a discussion on Stack Exchange that collects a large number of useful resources - you should pay attention to it. You can also read our material on how large world companies work with data.

    Fresh materials from our blog on Habré:


    Also popular now: