WebGL Application Performance
Kirill Dmitrenko (Yandex)
Hello! My name is Kirill Dmitrenko, for the last 4.5 years I have been working in Yandex as a front-end developer. And all this time I have been haunted by panoramas. When I joined the company, I did internal services for panoramas, after that I solved panorama tasks on large Yandex maps, and recently made a panorama web player on Canvas 2D, HTML and WebGL. Today I want to talk with you about the performance of WebGL applications.
To begin with, we will see what WebGL is, then we will discuss how to measure the performance of WebGL applications, and end with a discussion of some optimizations of WebGL applications and how I managed to apply them in the panorama player.
The structure of a WebGL application can be compared in some ways with how our ancestors saw the universe. Well, at least with one of those visions.
Those. this is a kind of world that relies on elephants, and those, in turn, stand on a large turtle. And in this picture, the world in which all logic, all meaning, and everything happens, will be our application. And our WebGL application will rely on three main types of resources:
- Buffers are such large blocks of bytes where we put our geometry, which we want to show to the user, our models.
- Textures are pictures that we want to overlay on our models so that they look more beautiful, more natural, look like objects of the real world.
- Shaders are small programs that run directly on the GPU and tell the GPU how we want to show our geometry, how we want to overlay our textures on it, etc.
And all these resources live, are born and die within the framework of the WebGL context. The WebGL context is such a big turtle in our picture, through it we get access to all resources, we create them and, in addition, it still has a big state that tells how we connect these resources to each other, how we bind together these buffers, shaders, textures, and how we use the results of the calculation of shaders. Those. great condition that affects rendering. Well, of course, it would be cool if our application worked faster.
But before talking about optimizations, you first need to learn how to measure. It is impossible to improve what we cannot measure. And the simplest tool that allows us to do this is the frame per second counter.
It is available in Chrom, it is not difficult to do it yourself, and with it, in general, everything is clear, i.e. if we have a lot of frames per second, then this is good, few frames - bad. But we must remember one subtlety that FPS in the application is limited by the refresh rate of the user's screen. For most users in the wild, this is 60 Hz, i.e. 60 frames per second, and therefore it doesn’t matter if we generate our frame in 5 ms, 10 ms ... There will always be 60 frames. The frame rate per second should be treated with a certain amount of skepticism. And yet, if we have few frames per second, the application slows down, we begin to look for bottlenecks in it. We are starting to try to optimize it.
However, in the case of OpenGL applications, the profiler is faced with a problem.
The fact is that the WebGL API is partially asynchronous, i.e. all draw calls are asynchronous, some other calls are also asynchronous. What does this mean, for example, with draw calls? That when, the call returned control to our JS, then the drawing by this time did not end, it will end sometime later. Moreover, it can begin even later than management has returned to us. Those. all that the draw call does - it generates some kind of command, adds it to the buffer and returns to us and says: “Go! Further!". Accordingly, in the profiler, such calls will not come up, and if rendering is slow for us, we will not see this.
But not everything is so bad, some calls are still synchronous, and some calls, they are not only synchronous, but they also maliciously synchronize our entire context. Those. they not only do the work that we assigned to them synchronously, they also, blocking our application at the same time, wait until the end of all the tasks that we assigned to the context before that, i.e. all rendering, all some kind of state changes, etc. The profiler will help us catch such problems.
And with rendering, another tool called EXT_disjoint_timer_query can help us.
This is an extension for WebGL, it is still terribly experimental, i.e. This is still a draft extension. It allows us to measure the execution time of our commands on the GPU. At the same time, which is very cool, it does it asynchronously. Those. it does not synchronize our context in some unnecessary way; it does not add many runtime overheads, i.e. all the code that we write to measure the performance of the application can be ship directly to production, directly to users. And, for example, collect statistics. Or, even better, based on these data, on the basis of these measurements, adjust the picture quality to the user's equipment, i.e. on smartphones to show users one picture, on tablets - better quality, because there is usually more powerful iron, and on desktops, where there is very powerful iron, show the coolest picture with the coolest effects. But at the same time, you need to write some code around this - this is a small minus.
EXT_disjoint_timer_query allows you to measure the execution time of the first or several calls directly to the GPU. It also allows you to set accurate time measures in the conveyor. Those. it allows, for example, to measure how much time our GL teams go from our JS to the GPU, i.e. what are our delays in the conveyor.
EXT_disjoint_timer_query is also a synchronous API, it allows you to create query objects for the driver, for the implementation of WebGL, work is going on with the query objects through it. You create a query object and say: “No, I want to put an exact time stamp on this object or start using it to measure the execution time of a group of commands.” After that, you finish building the frame and usually on the next frame, though, perhaps later, the measurement results in nanoseconds become available to you, quite accurate at the same time. And these measurements can be used.
To summarize the conclusions about the measurements.
We have a frame per second counter as a kind of qualitative characteristic of the application, which allows us to tell us whether we are slowing down or not. And if we slow down, then the JS profiler will help us with finding bottlenecks on the CPU in our JS, in some kind of math, in some kind of supporting code. EXT_disjoint_timer_query will help us with finding GPU bottlenecks.
Ok, we learned how to measure performance, we learned how to look for bottlenecks, now we need to understand what to do next. Here we discuss only a few optimizations. Computer graphics is such a large area of hacks, deceptions and strange optimizations. We will discuss only some general optimizations, how I applied them in the panorama player.
And the first rule that needs to be adopted on the way to a fast GL application is to work carefully with the WebGL state. As I said at the very beginning, in the WebGL context there is a big chunk of state that affects the rendering and you need to work with it carefully. In particular, there is no need to make get * and read * calls, i.e. calls that request the current state of the context or read data from the video card. If we really need it. Why? Because these calls, not only are they synchronous, they can also cause synchronization of the whole context, thereby slowing down our applications. In particular, the getError () method, which checks to see if we have an error in working with the WebGL context. It should be called only in development'e, never in development. And you need to minimize state switching, do not do it too often,
How can it look in our code?
In particular, it may look something like this, i.e. nested cycles, where in external cycles we switch more expensive states, we switch them less often, and in more nested cycles we switch states that are cheaper to switch. In particular, connecting framebuffers or switching shader programs is quite expensive switching, they are almost guaranteed to cause synchronization and, in general, take quite a while. And switching textures and shader parameters is faster, they can be done more often.
But even in this wonderful design, you can optimize further.
You can do less draw calls and more work for each, i.e. each draw call adds some kind of overhead to the state validation, to send this data to the video card. And it would be nice to erode this overhead. And you can only blur it in one way - by making these calls smaller, adding fewer overheads, while doing more work for each call. In particular, you can stack several objects, several models that we want to draw, in one buffer with data. And if they use textures, then put all these textures into one large picture, which is usually called a texture atlas. And thus we will bring several objects to the point that they will require the same state for rendering, and we can draw them in one call.
There is such a good technique called Instancing, which allows you to draw multiple copies of the same object with different parameters in one call. A good example of the application of Instancing optimization is the particle system. I wanted to show the logo of our wonderful conference where we are, drawn using a particle system, i.e. these are small objects, small models, of which there are a lot, there were 50 thousand. And in the first case, I painted each of them separately, with a naive approach. And in the second case, I used a special extension.
In general, using extensions is a good practice, they often contain very cool features, without which it is difficult or impossible to do some things. On the contrary, you need to use them carefully, i.e. you should always leave a siding in code that will work without extensions.
I used ANGLE_instanced_arrays there, which implements Instancing for WebGL, i.e. I put all my parameters for all copies of objects in a large buffer, told WebGL that it was a buffer with copy parameters, and - hop! and he drew. And then there was a beautiful effect that from 10 to 60 FPS everything went up there.
What did we have with this case in the panoramas?
In panoramas, we got the minimization of context switching virtually free. Why did this happen? Because our task itself was so, i.e. we draw in the player first a panorama, then signatures of objects, and then controls, transition arrows and a quick transition washer. And from this wording it follows that we draw the same objects side by side in the code. In this case, we applied Instancing for markers, i.e. we draw all markers, all object labels in one draw call. And yet we do not generate draw calls for sectors of the panorama that are not visible. Why generate a call for something that we still will not see on the screen? Those. a technique commonly called viewport culling or frustrum culling.
And in this place, it is important to say that we have spherical panoramas in Yandex, i.e. this is some kind of picture superimposed on a sphere. When you open the player, you “stand” in the center of this picture and look out. This is what it looks like if you look from the outside of the sphere, and not from the inside.
It looks like a panoramic structure, a panoramic picture coming from the server, i.e. we have it in an equal rectangular projection, we keep it that way:
We also cut it into small sectors in order to make it convenient to work with it in WebGL:
In WebGL you cannot create very large textures.
And for each such sector we can calculate whether it gets on our screen or not. If it does not fall, we not only exclude it from the list for rendering, we also delete the resources that it occupies, i.e. save memory. As soon as it becomes visible again, we select an empty texture for it, and we reload this image from the browser cache.
Something like this looks in geometry, i.e. each sector of the texture corresponds to a sector in the geometry of the sphere, and so this is all drawn:
Another big and terrible sin in WebGL applications is Overdraw.
What is overdraw? This is when we count pixels several times. For example, as in the diagram shown on the slide. If we first draw a rectangle, then a circle, and then a triangle, it turns out that we considered part of the rectangle, and after that part of the circle, in vain. We still do not see them, and the resources spent on them. It would be nice to avoid this. There are various techniques for this.
The first is to calculate, like we do with texture sectors, invisible objects, i.e. objects that are closed by other objects from us, on the CPU in our JS and do not draw them, do not generate draw calls for them. This technique is called occlusion culling. This is not always easy to do in JS, JS is not very fast with math.
Simpler techniques can be applied, for example, sorting objects by depth, i.e. draw them from near to far, and then the video card itself will be able to exclude from this calculation those pixels that it will know for sure that they won’t get on the screen, that they are farther than those that are already in the frame.
And if it is difficult to do this, for example, because objects are extended or artfully intersect, you can do a Depth pre-pass, i.e. using very simple shaders, fill the depth buffer, thereby telling the video card the distance of each pixel from the screen, etc., the video card will be able to apply exactly the same optimization as in the previous case.
In panoramas, we ran into overdrow, solving another problem. It must be that the panoramic picture is sawn on the server in several sizes, in several resolutions, and we want to display to the user the size of the picture that is most suitable for the current zoom of the player or the current zoom of the panorama. However, at the same time we want to show him the best quality that we already have, which is not always optimal, may be of lower quality. What does it look like?
First, we draw the worst picture quality, the most blurry, the most soapy, because it arrives the fastest. After that, we draw on top of it all that we have better qualities.
And here it is clear that we do not have all the sections of the picture, some are missing and they are transparent and through them lower quality is visible.
And after that we draw the best quality that we have, which may also be partially transparent, because the details of the picture have not arrived yet.
And again, we have gaps through which the previous layer is visible:
Thus, in four passes, we generate a panorama picture on the screen for the user. Why did this happen? Because, firstly, part of the texture can be transparent, and it wasn’t very trivial to calculate which, generally speaking, are transparent and which lower-quality parts of the texture are covered with parts of higher quality. Therefore, I had to suffer so. On Retina-screens, this was very slow, because there are a lot of pixels, and it didn’t work very well on mobile.
How did we get around this? If you look closely at this diagram, which I just showed, we can assume that, generally speaking, this operation can be done at a time. Those. once you have collapsed these layers into one intermediate buffer, you can call it a cache, and then only draw it in a frame. Exactly this we did.
Let’s say, we have arrived data of poor quality, i.e. on the right we have data, on the left - the texture is panoramic, well, a piece of it. We have arrived good quality data - we have drawn a texture. Better data arrived - we drew its texture on top of the bad data. Very good data arrived - and we also drew them on top. We always have one picture, which is offline collected from the arriving data.
Of course, everything is not always so good, the network is unpredictable, and the data may arrive in the wrong order. But we solve the problem quite simply.
Again a bad picture comes to us, then the best picture arrives, we draw it on top. And then a picture of a lower quality arrives, we, of course, do not display it. We leave the best quality that we already have. T.O. we reduce several panorama rendering passes to one, and this accelerated us on different devices, including Retina devices.
Often our rendering code, which is written in WebGL, works very quickly, it is capable of delivering 60 fps on smartphones, tablets, desktop computers. But any code around that is inevitably present in our web application, code that updates the DOM, code that loads resources, code that processes user events, UI events, it starts to slow us down, i.e. it takes control for a long time, causing delays in animations, delays in rendering and slows down our application.
With this code, it seems, nothing can be done except to break it into small parts, i.e. if 100 images are loaded with us, then you do not need to load 100 at once, you need to load 10, dozens, nickels, as it is convenient. And write a scheduler that will execute this code only when there really is time for it, when it does not slow down the animation.
For us in panoramas, such a code was loading tiles and crowding out the invisible parts of the panorama, about which I spoke at the very beginning. We did it very simply - we never do this while the panorama is moving, because if we have a static picture, it doesn’t matter what fps we have, how many frames per second we show, maybe 1, maybe 60 - the user doesn’t notice the difference, the picture does not move, nothing changes on the screen. Therefore, we wait until the animation ends and recount the visible tiles, recount the visible sectors, textures. We create requests and manage resources.
In addition, the events of the player’s software interface became the source of problems. A player is a widget that is embedded in large maps, and large maps use some external events in order to show some of their additional interface elements.
With API events, we did pretty simple. Firstly, we have throttled them, i.e. those events that should be constantly generated, for example, while the direction of the user's gaze is changing, the direction of the gaze in the panorama, we have throttled these events, i.e. significantly reduced the frequency of their generation. But this was not enough, for example, the history API of the browser created a lot of problems for us in this place. We tried for each change of sight event to update the link in the address bar, but it terribly slowed everything down. And so a special event was made, which already told the external code that the panorama had stopped and that hard work could be done. The same scheme, by the way, is used in the Yandex.Maps API. There are special events where you can do some more difficult work.
Let's draw conclusions about optimization.
The first and most important thing is that you need to carefully work with the state, because in the most unexpected place a malicious synchronization may occur, to slow down our application. Do not cause read status, do not read, unless there is an urgent need to read data from the video card and not cause get error in production. It's the most important.
To do less draw calls and more work for each of them, i.e. we blur the overhead they add.
Avoid overdraw with a wide variety of techniques, any that suit your application.
And a scheduler for the code around, i.e. a scheduler for code that supports rendering, and so that it does not take control for a long time and does not slow down.
» Profile on Github
» Profile on Facebook
This report is a transcript of one of the best speeches at the conference of developers of highly loaded systems HighLoad ++ .
Yes, for two years now we have been holding a section called “Front End Performance”.
All HighLoad ++ videos are published in our YouTube account , but we have not yet done the work of organizing and separating the entire section into a separate playlist :(
This year, four reports with the WebGL tag were submitted to the conference of frontend developers Frontend Conf:
- Do-it-yourself interactive 3D maps / Alexander Amosov (Avito);
- We port the existing Web-application into virtual reality / Denis Radin (Liberty Global);
- Components on the GLSL shaders to control every pixel of the Web application without losing performance or the most technologically advanced spinner in the browser, how and why was it created? / Denis Radin (Liberty Global);
- Web dive or virtual reality with WebVR / Tatyana Kuznetsova (DevExpress).
Interesting? Then we are waiting for you at the conference ! But it's for the money.
Or for free on the broadcast , which we organize together with Habr :). It will not be broadcast by Frontend Conf itself, but by the best reports from the entire festival .