Pitfalls When Using Linked Server

    An interesting project related to processing the task queue came to our company. The project was previously developed by another team. We needed to deal with the problems that arise with a large load on the queue, and, accordingly, fix the problems found.

    In short, the project consists of several databases and applications located on different servers. The “task” in this project is a stored procedure or .Net application. Accordingly, the "task" must be performed on a specific database and on a specific server.

    image

    All data related to the queue is stored on a dedicated server. On servers where you need to perform tasks, only metadata is stored, i.e. procedures, functions, and service data that relate to this server. Accordingly, the data related to the tasks we receive queries using LinkedServer.

    All data related to the queue is stored on a dedicated server. On servers where you need to perform tasks, only metadata is stored, i.e. procedures, functions, and service data that relate to this server. Accordingly, the data related to the tasks we receive queries using LinkedServer.

    Why is that?


    1. Convenience. We can indicate at any time that data is now stored on server B.
    2. So it was realized before us.

    The following are the two most popular classic ways to process a queue:

    1. Send a notification to the task handler about the presence of a task.
    2. Poll the queue for tasks.

    Initially, the project implemented the second option. To minimize the latency of processing tasks, our application polls the queue every 100-500ms.

    Actually, there is nothing wrong with this, except for one thing - with this implementation, the table is once again blocked. I will say in advance that the query uses row locking with the ability to only read unlocked rows:

    READPAST, ROWLOCK, UPDLOCK
    

    So, back to the problem. In the analysis, I noticed the value of the counter - batch requests / sec in Active Monitor. This value, with a small number (about 50) of tasks in the queue, went off scale for 1000, and the CPU load increased sharply.

    First thought: you need to proceed to the implementation of the first option (sending a notification to the task handler). This method was implemented using Service Broker and SignalR :

    • Service Broker was used to send notifications about the appearance of a task;
    • SignalR was used to send notifications to task handlers.

    Why SignalR?


    This tool is already used in the project, and the deadlines were tightened, so I did not implement something similar, for example, NServiceBus .

    My surprise knew no bounds when this solution did not help. Yes, there was a gain in productivity, but this did not solve the problem completely. A stress test was written for debugging, when more than 500 tasks are added to the queue.

    The creation of such a stress test made it possible to find the " root of evil ."

    Analysis of the list of active queries and performance reports, during a heavy load of showing the presence of "very interesting queries", which consisted of one command:

    fetch api_cursor0000000000000003
    

    Further analysis revealed that these are queries from LinkedServer. The question immediately arose: “Is it really a request of this type select * from RemoteServer.RemoteDatabase.dbo.RemoteTable where FieldId = Value generates a request (fetch api_cursor0000000000000003) to RemoteServer?” It turns out, yes, and even when LinkedServer is MS SQL.

    For a more illustrative example, create the table “Test” (the table creation code is available in the appendix to the article) on server “A”, and on server “B” we will execute the request:

    select * from dev2.test_db.dbo.test
    

    where dev2 is our server “A".

    At the first execution of such a request, we will have a similar log in the profiler on server A:

    Part of the log on server A


    Full log is available here .

    And now we’ll execute the requests by ID:

    select * from dev2.test_db.dbo.test where ID = 3
    

    Profiler log for the second request


    Full log is available here .

    As you can see in the screenshot, the request plan was added to the cache. If you run this request a second time, then a little better.

    profiler log after restart


    Full log is available here.

    As we can see, the data is already taken from the cache.

    When conditions change, we get a similar sample - the first sample for a given Id . But the bottom line is that with large numbers of different requests, the cache is not enough. And sql starts to make a bunch of queries to the table, which leads to "brakes". You ask: “What about the indices?” There are indexes, but even queries with a Primary Key (PK) condition caused this problem.

    What does Google say about this? And a lot of things, but no sense:

    • That requests must be executed from a user who belongs to one of the following roles: sysadmin, db_owner, db_ddladmin, so that statistics can be used;
    • Incorrectly configured LinkedServer.

    More sensible answers were found only in 3 articles:


    As far as I understand, you cannot configure LinkedServer so that Pull technology is always used to receive data from LinkedServer. It all depends on where you are processing the request.

    Time was running out, and the only solution that could save us was to rewrite some of the queries in dynamic sql. Those. execute requests on the server on which the data is stored.

    There are several ways to work with data on LinkedServer:

    1. In the request, directly indicate the data source - the remote server. This implementation has several disadvantages:
      • low productivity;
      • returns a large amount of data.

      select * from RemoteServer.RemoteDatabase.dbo.RemoteTable where Id = @Id
      

    2. Use OPENQUERY. Not suitable for a number of reasons:


      select  * from OPENQUERY(RemoteServer, 'select * from RemoteDatabase.dbo.RemoteTable').
      

      The links provide sample logs for the following queries. These requests will be executed on server “B”, and the logs from server “A”:

      select  * from    OPENQUERY(dev2, 'select * from test_db.dbo.test') where id = 26
      


      select  * from    OPENQUERY(dev2, 'select * from test_db.dbo.test where ID =  26')
      


    3. Run the request on the remote server. Similar to OPENQUERY:

      • you cannot specify the server name as a parameter, since the name is specified at the stage of compilation of the procedure;
      • There are also problems that were described in Dynamic T-SQL and how it can be useful.

      exec ('select * from RemoteDatabase.dbo.RemoteTable') at RemoteServer
      

      The links provide sample logs for the following queries:

      exec ('select * from test_db.dbo.test') at dev2
      


      exec ('select * from test_db.dbo.test where Id = 30') at dev2
      


    4. It is still possible to execute the request on the remote server by running sp_executesql.

      DECLARE @C_SP_CMD nvarchar(50) =  QUOTENAME(@RemoteServer) + N'.'+@RemoteDatabase +N'.sys.sp_executesql'
      DECLARE @C_SQL_CMD nvarchar(4000) = 'select * from dbo.RemoteTable'
      EXEC @C_SP_CMD @C_SQL_CMD
      

      Examples of query execution logs using sp_executesql are available via the links:



    The fourth method was used to solve the problem.

    Below are some graphs of incoming and outgoing traffic on the server where the main queue base is located before and after using sp_executesql. The size of the database is 200-300Mb.

    incoming and outgoing traffic a few days on the server, before using sp_executesql


    incoming and outgoing traffic, after the start of using sp_executesql


    Outgoing peaks are copying backup to NFS.

    The conclusion suggests itself: initially, the driver from MS for working with "MS sql linked server" cannot itself execute requests on the data source server. Therefore, colleagues, let's try to execute them at the data source, in order to solve at least part of the performance issues.

    Files for the article.

    Also popular now: