Entity Framework and Performance
In the process of working on a web portal project, I explored opportunities to improve productivity, and came across a short article about micro-ORM Dapper, which was written by the authors of the StackOverflow.com project. Initially, their project was written in Linq2Sql, and now all performance-critical places have been rewritten using the indicated solution.
The disadvantage of this, as well as other similar solutions that I managed to see, is that they help very little to simplify the development process, by and large providing only materialization, hiding work with ADO.Net directly. SQL queries need to be written by hand.
Linq2Entities syntax disposes to more “clean code”, allowing both code testing and its reuse. In addition, when changing in the database, immediately after updating the context, the compiler will generate errors in all places where a remote or renamed field is used, the changed structure of the links between the tables will highlight those places where the corresponding navigation properties are used.
But the article is not about how much EF speeds up development, and not that it is not very good to have part of the requests written in linq, but part directly in sql. Here I will provide a solution that allows combining EF entities and Linq2Entities queries on the one hand and the “pure performance” of ADO.Net on the other. But first, a little background. Everyone who worked with such projects, I believe, was faced with the fact that per-row calls work very slowly. And many probably tried to optimize this by writing a huge query and squeezing everything they could into it. This works, but it looks very scary - the method code is huge, it is difficult to maintain and impossible to test. The first stage of the solution that I tried is the materialization of all the necessary entities, with each individual request.
I will explain with an example. You need to display a list of insurance policies, the initial request looks something like this:
Further, to display the necessary information, we need dependent, or as they can also be called, "child" entities.
Entities connected through NavProps can also be loaded using Include, but this has its own difficulties; it turned out to be simpler (and more productive, more on that later) as in the aforementioned example.
This alteration in itself did not give such a performance gain, relative to the initial comprehensive request, but simplified the code, allowed to break it into smaller methods, and made the code more attractive and familiar to the look.
Performance came in the next step, when I launched the SQL server profiler, I found that two out of 30 queries were 10-15 times longer than the rest. The first of these queries was like this
As it turned out, EF generates a very unsuccessful request, and just moving GroupBy from last place to first, I brought the speed of these requests closer to the others, getting about 30-35% reduction in the total execution time.
Just in case, I will say that Join in this request is equivalent to SelectMany in the previous one.
To find and eliminate such a flaw in the bowels of a huge request is problematic, on the verge of impossible. And through Include this is also not possible.
Returning to the beginning of the article, to micro-ORM, I want to immediately say that such an approach may not be justified in all scenarios. In ours it was necessary to load a portion of data from the database, do some conversions and calculations and send it to the client in the browser using JSON.
As a prototype of the solution, I tried to implement materialization through PetaPoco, and was very impressed with the test result, the time difference in the materialization of the target group of queries was 4.6x (756ms against 3493ms). Although it would be more correct to say that I was disappointed with the performance of EF.
For reasons of strict settings in StyleCop, it wasn’t possible to use PetaPoco in the project, and in order to adapt it to the task I had to get into it and make changes, so the idea was ripe to write my own solution.
The solution relies on the fact that when generating queries, EF in the query will indicate the field names for the dataset that correspond to the property names of the objects that it generated for the context. Alternatively, you can rely on the order of these fields, which also works. To extract the query and parameters from the query, the ToObjectQuery method is used, and the ToTraceString method and the Parameters property are already used on the resulting object. The following is a simple reading cycle, taken from MSDN. Materializers are the highlight of the solution. PetaPoco emits the materializer code in runtime, but I decided to generate code for them using T4 Templates. I took as a basis a file that generates entities for context, while reading .edmx, used all auxiliary classes from it, and replaced the directly generating code.
An example of a generated class:
The code that PetaPoco emits is conditionally identical to this, which is also confirmed by the same execution time.
As you can see, the class also implements the IEqualityComparer interface, from which it should already be clear that on objects materialized in this way, the usual ReferenceEquals comparison does not work anymore, unlike the objects that EF materializes, and in order to make Distinct in memory, and needed such an addition.
I designed the result of the research in the form of an Item Template and published it in the Visual Studio gallery . A brief description of how to use is present there. I would be glad if someone is interested in a solution.
The disadvantage of this, as well as other similar solutions that I managed to see, is that they help very little to simplify the development process, by and large providing only materialization, hiding work with ADO.Net directly. SQL queries need to be written by hand.
Linq2Entities syntax disposes to more “clean code”, allowing both code testing and its reuse. In addition, when changing in the database, immediately after updating the context, the compiler will generate errors in all places where a remote or renamed field is used, the changed structure of the links between the tables will highlight those places where the corresponding navigation properties are used.
But the article is not about how much EF speeds up development, and not that it is not very good to have part of the requests written in linq, but part directly in sql. Here I will provide a solution that allows combining EF entities and Linq2Entities queries on the one hand and the “pure performance” of ADO.Net on the other. But first, a little background. Everyone who worked with such projects, I believe, was faced with the fact that per-row calls work very slowly. And many probably tried to optimize this by writing a huge query and squeezing everything they could into it. This works, but it looks very scary - the method code is huge, it is difficult to maintain and impossible to test. The first stage of the solution that I tried is the materialization of all the necessary entities, with each individual request.
I will explain with an example. You need to display a list of insurance policies, the initial request looks something like this:
int clientId = 42;
var policies = context.Set().Where(x => x.active_indicator).Where(x => x.client_id == clientId);
Further, to display the necessary information, we need dependent, or as they can also be called, "child" entities.
var coverages = policies.SelectMany(x => x.coverages);
var premiums = coverages.Select(x => x.premium).Where(x => x.premium_type == SomeIntConstant);
Entities connected through NavProps can also be loaded using Include, but this has its own difficulties; it turned out to be simpler (and more productive, more on that later) as in the aforementioned example.
This alteration in itself did not give such a performance gain, relative to the initial comprehensive request, but simplified the code, allowed to break it into smaller methods, and made the code more attractive and familiar to the look.
Performance came in the next step, when I launched the SQL server profiler, I found that two out of 30 queries were 10-15 times longer than the rest. The first of these queries was like this
var tasks = workflows.SelectMany(x => x.task)
.Where(x => types.Contains(x.task_type))
.GroupBy(x => new { x.workflow_id, x.task_type})
.Select(x => x.OrderByDescending(y => y.task_id).FirstOrDefault());
As it turned out, EF generates a very unsuccessful request, and just moving GroupBy from last place to first, I brought the speed of these requests closer to the others, getting about 30-35% reduction in the total execution time.
var tasks = context.Set
.GroupBy(x => new { x.workflow_id, x.task_type})
.Select(x => x.OrderByDescending(y => y.task_id).FirstOrDefault())
.Join(workflows, task => task.workflow_id, workflow => workflow.workflow_id, (task, workflow) => task)
.Where(x => types.Contains(x.task_type));
Just in case, I will say that Join in this request is equivalent to SelectMany in the previous one.
To find and eliminate such a flaw in the bowels of a huge request is problematic, on the verge of impossible. And through Include this is also not possible.
Returning to the beginning of the article, to micro-ORM, I want to immediately say that such an approach may not be justified in all scenarios. In ours it was necessary to load a portion of data from the database, do some conversions and calculations and send it to the client in the browser using JSON.
As a prototype of the solution, I tried to implement materialization through PetaPoco, and was very impressed with the test result, the time difference in the materialization of the target group of queries was 4.6x (756ms against 3493ms). Although it would be more correct to say that I was disappointed with the performance of EF.
For reasons of strict settings in StyleCop, it wasn’t possible to use PetaPoco in the project, and in order to adapt it to the task I had to get into it and make changes, so the idea was ripe to write my own solution.
The solution relies on the fact that when generating queries, EF in the query will indicate the field names for the dataset that correspond to the property names of the objects that it generated for the context. Alternatively, you can rely on the order of these fields, which also works. To extract the query and parameters from the query, the ToObjectQuery method is used, and the ToTraceString method and the Parameters property are already used on the resulting object. The following is a simple reading cycle, taken from MSDN. Materializers are the highlight of the solution. PetaPoco emits the materializer code in runtime, but I decided to generate code for them using T4 Templates. I took as a basis a file that generates entities for context, while reading .edmx, used all auxiliary classes from it, and replaced the directly generating code.
An example of a generated class:
public class currencyMaterialize : IMaterialize, IEqualityComparer
{
public currency Materialize(IDataRecord input)
{
var output = new currency();
output.currency_id = (int)input["currency_id"];
output.currency_code = input["currency_code"] as string;
output.currency_name = input["currency_name"] as string;
return output;
}
public bool Equals(currency x, currency y)
{
return x.currency_id == y.currency_id;
}
public int GetHashCode(currency obj)
{
return obj.currency_id.GetHashCode();
}
}
The code that PetaPoco emits is conditionally identical to this, which is also confirmed by the same execution time.
As you can see, the class also implements the IEqualityComparer interface, from which it should already be clear that on objects materialized in this way, the usual ReferenceEquals comparison does not work anymore, unlike the objects that EF materializes, and in order to make Distinct in memory, and needed such an addition.
I designed the result of the research in the form of an Item Template and published it in the Visual Studio gallery . A brief description of how to use is present there. I would be glad if someone is interested in a solution.