jobgemws October 31, 2016 at 12:07

The problem with periodically long running queries in MS SQL Server

From the sandbox
Tutorial

Foreword

There is an information system that I administer. The system consists of the following components:

1. MS SQL Server database
2. Server application
3. Client applications

These information systems are installed on several objects. The information system is actively and around the clock used simultaneously from 2 to 20 users at each facility. Therefore, you cannot perform routine maintenance all at once. That is, you have to smear index defragmentation for the whole day, and not in one fell swoop to defragment all the necessary fragmented indexes. Similarly with other works.

Auto-update statistics is set in the properties of the database itself. Also, statistics are updated by a defragmented index.

Problem

About a year ago, I encountered the following problem:

Periodically, all requests were executed for a long time. Moreover, the braking times were random. This happened at every facility on a random day. Moreover, when I began to analyze how often braking occurs (by a profiler), it was possible to find out that they occur every day at a random time. It’s just that users do not always pay attention to it, but perceive it as the only random delay, and then the system again quickly works.

Path to salvation

The queries themselves were reviewed, which took a long time to complete. But the most interesting thing is that all requests at random time were executed for a long time. Even the simplest types pull out the last record from the table in several thousand rows.

Further, the following work was carried out:

MS SQL Server and Windows Server logs are analyzed - the cause of the braking could not be found
Indexes analyzed (fragmentation, etc.) - missing ones added and unused ones removed
Queries analyzed - some queries improved
Jobs in SQL Agent are analyzed - tasks could not be linked to the braking problem
The tasks in the Task Scheduler are analyzed - the tasks could not be linked to the braking problem
Profiler also produced an effect, not a cause of inhibition.
Check for deadlocks - no long locks were detected at all

As a result, more than 3 months were spent on unsuccessful searches for the cause of periodic inhibitions. However, it was possible to identify an interesting fact - all requests increased the Elapsed expectation rate, and not the Worker completion indicator. What led to the fact that something is possible with the disks. I also checked them - everything is fine.

Decision

What was surprising was that we were able to randomly establish that when the request in the application is slow, it is fast in the SSMS itself. Then the following article helped to solve the problem (it at least subsequently came up with an idea).

I will quote the following paragraph from this article:

In practice, the most important SET option is ARITHABORT, because the default value for this option is different for applications and for SQL Server Management Studio. This explains why you can detect a slow request in your application, and then get good speed by doing it in SSMS. The application uses a plan that was built for a set of values different from the current, correct values. Whereas if you run a query in SSMS, then most likely the cache does not yet have an execution plan for ARITHABORT ON, and therefore SQL Server will build a plan for your current values.

The difference in execution was in the SET ARITHABORT parameter . For all requests made in SSMS, this option is enabled, and for requests from outside (from applications) it is disabled. And it cannot be turned on even with a simple request for applications:

SET ARITHABORT ON;

After that, a crazy idea followed - at the time of freezing, clear the procedural cache: click .

For subsequent manual verification, before writing a request to SSMS, you must write:

SET ARITHABORT OFF;

Then the request will be executed as if it came from the application. When the request took a long time, I cleared the procedural cache. And it always healed. That is, before clearing the procedural cache, the request could run up to 20-30 seconds, and after - 0 seconds.

After that, another experiment was delivered - cleaning the entire procedural cache for the entire database every hour through SQL Agent:

--очищаем кэш по id БД
DBCC FLUSHPROCINDB(@db_id);

After that, all requests began to be simply executed very quickly (less than 0.05 seconds), there were only single outliers up to 5-10 seconds of execution, but users did not notice any freezes. Moreover, updating statistics did not improve the results, so I removed the statistics update.

After several months of research, it was possible to establish that single hangs occur when either the cache eats up on the server itself and there is nothing left or remains, but less than 1 GB of RAM, or the MS SQL Server service eats up all the RAM allocated to it (via Task Manager). But the second happened only 2 times for the entire study.

The fact is that literally everything is written to the cache, but the cache is not always released on time. The problem with the cache was solved using the programEmptyStandbyList.exe .

This application is configured through the Task Scheduler to run 1 time every hour. After the work has been done for more than six months, there is no braking on requests at all facilities.

The only thing that remains unclear is the rare cases when one request freezes for 5-10 seconds once a month on a random day and at a random time. In total there were 4 such cases in six months, and then at two sites, and not at all. In this case, the MS SQL Server service eats for a short time all the RAM allocated to it.

I took the steps described in the article , but this solution did not help.

In principle, it was possible not to dig deeper, because users do not notice a hang and everything works, but if anyone has any thoughts, share it, I will be grateful.

I wrote this article with the aim of helping those who stumble upon such problems, because I did not find a comprehensive answer on the Internet, and a lot of time was spent studying the problem and solving it.

Sources

» One ⬝ Two ⬝ Three ⬝ Four ⬝ Five ⬝ Six ⬝ Seven ⬝ Eight

Tags:

MS SQL Server