YourDestiny April 17, 2019 at 10:41

GPU, hexagonal accelerators and linear algebra

All these words are much more connected with mobile development than it seems at first glance: hexagonal accelerators already help to train neural networks on mobile devices; algebra and matan come in handy to get a job at Apple; and GPU programming not only allows you to speed up applications, but also teaches you to see the essence of things.

In any case, so says the head of Prisma mobile development Andrey Volodin . And also about how ideas flow into mobile development from GameDev, how paradigms differ, why Android does not have native blur - and much more, the productive release of AppsCast has been released. Under the cut, we’ll talk about Andrey’s report on AppsConf without spoilers.

AppsCast- A podcast dedicated to the conference for AppsConf mobile developers. Each issue is a new guest. Each guest is a speaker of the conference with whom we discuss his report and talk on topics related to it. The podcast is hosted by members of the AppsConf program committee, Alexei Kudryavtsev and Daniil Popov.

Alexey Kudryavtsev: Hello everyone! Andrey, please tell us about your experience.

Andrey Volodin : We at Prisma are developing products that are mainly related to the processing of photos and videos. Our flagship app is Prisma. Now we are making another Lensa application for Facetune-like functionality.

I lead mobile development, but I'm a gaming coach. I have the whole core part, I write GPU pipelines for all these applications. I develop core frameworks so that the algorithms and neurons that the R&D team developed run on mobile devices, work in realtime. In short, to kill server computing and all that.

Alexei Kudryavtsev: It doesn’t sound like regular iOS development.

Andrey Volodin: Yes, I have such specifics - I write on Swift every day, but at the same time it is very far from what is considered iOS development.

Daniil Popov: You mentioned the GPU pipelines, what is it all about?

Andrey Volodin:When you make photo editors, you also need to configure the architecture and decompose the logic, because the application has different tools. For example, in Lensa there is a bokeh tool that blurs the background using a neuron, there is a retouching tool that makes a person more beautiful. All this needs to work more efficiently on the GPU. Moreover, it is advisable not to transfer data between the processor and the video card every time, but to pre-build a set of operations, perform them in one run, and show the user the final result.

GPU pipelines are “little lumps” from which instructions for a video card are assembled. Then she does all this very quickly and efficiently, and you take the result at a time, and not after each instrument. I make sure that our GPU pipelines are as fast as possible, efficient and generally exist in principle.

Alexey Kudryavtsev: Tell me, how did you come to this? A regular iOS developer starts with riveting and molds, then goes somewhere by the API and is happy. How did it happen that you are doing something completely different?

Andrey Volodin:For the most part, this is a coincidence. Before I got a job, I made games for iOS. It was always interesting to me, but I understood that in Russia there is especially nowhere to develop in this direction. It so happened that we found each other with Prisma. They needed an iOS developer who can write on Swift and at the same time knows the GPU, in particular, Metal, which just came out then, and I definitely fit that description.

I responded to the vacancy, we had a synergy, and for the third year now I’ve been going deeper and deeper into this thing. If something goes wrong now, then I already have all of these Viper and MVVMs - I don’t even know how it decrypts - I will have to understand from the very beginning.

What does the GPU Engineer do

Daniil Popov: Your AppsConf profile says the Engineer GPU. What does the Engineer GPU do most of the day besides drinking coffee?

Andrey Volodin: Here it is necessary to mention how the processor is fundamentally different from the GPU. The processor performs operations as if sequentially. Even the multithreading that we have is often fake: the processor stops and switches to make small pieces of different tasks, and performs them in a few slices. The GPU works in exactly the opposite way. There are n processors that really work in parallel, and there is parallelism between processes and parallelism within the GPU.

My main job, in addition to commonplace things like optimizing memory operations and organizing code reuse, is that I port the algorithms that are written for the CPU to the video cards so that they parallel. This is not always a trivial task, because there are very efficient algorithms that are completely tied to the sequential execution of instructions. My job is to come up with, for example, an approximation for such an algorithm that does, perhaps, not exactly the same thing, but visually the result cannot be distinguished. So we can get acceleration 100 times, a little sacrificing quality.

I'm also porting neurons. By the way, we will soon make a major Open Source release. Even before Core ML appeared, we had our own counterpart, and we finally ripened to put it in Open Source. Its paradigm is slightly different from Core ML. I, including, are developing its core part.

In general, I do everything around Computer Vision algorithms and computing.

Alexey Kudryavtsev: An interesting announcement.

Andrey Volodin: This is not a secret, we will not announce it with some kind of fanfare, just it will be possible to see an example of the frameworks that are used inside Prisma.

Why optimize for GPU

Alexei Kudryavtsev: Tell me, please, why do we optimize algorithms for GPU in general? It may seem that it is enough to add cores to the processor or optimize the algorithm. Why exactly the GPU?

Andrey Volodin: Work on the GPU can tremendously speed up the algorithms. For example, we have neurons that will run on the Samsung S10 central processor for 30 s, and on the GPU there will be 1 frame, i.e. 1/60 s. This is incredibly changing user experience. There is no eternal loading screen, you can see the result of the algorithm working on the video stream, or twist the slider and see the effects right there.

It’s not at all that we are too cool to write on the CPU, so we rewrite everything on the GPU. Using a GPU has a transparent goal - speeding things up.

Alexei Kudryavtsev: the GPU handles operations similar to each other well in parallel. Do you have just such operations and therefore manage to achieve such success?

Andrey Volodin: Yes, the main difficulty is not to code, but to create such algorithms that are well transferred to the GPU. This is not always trivial. It happens that you figured out how to do everything cool, but for this you need too many synchronization points. For example, you write everything in one property, and this is a clear sign that it will be poorly parallel. If you write a lot in one place, then all threads will need to synchronize for this. Our task is to approximate the algorithms so that they parallel well.

Alexei Kudryavtsev: For me, as a mobile developer, it sounds like rocket science.

Andrey Volodin: Actually, it’s not so difficult. For me, rocket science is VIPER.

Third chip

Daniil Popov: It seems that at the last Google I / O conference they announced a piece of iron for TensorFlow and other things. When will the third chip finally appear in mobile phones, TPU or what will it be called, which will also do all the ML magic on the device?

Andrey Volodin: We have this very thing, it connects via USB, and you can drive neurons from Google on it. Huawei already has this, we even wrote software for their hexagonal accelerators, so that segmentation neurons would quickly chase the P20.

I must say that in the iPhone they actually already exist. For example, in the latest iPhone XS there is a coprocessor called NPU (Neural Processing Unit), but so far only Apple has access to it. This coprocessor is already cutting the GPU in the iPhone. Some Core ML models use NPUs and are therefore faster than bare Metal.

This is significant, given that in addition to the lowest inference neurons, Core ML requires a lot of additional action. First you need to convert the input data to Core ML format, it will process it, then return it in its format - you need to convert it back, and only then show it to the user. This all takes quite some time. We write overhead free pipelines that work from the beginning to the end on the GPU, while Core ML models are faster precisely due to this hardware process.

Most likely, at WWDC in June they will show a framework for working with NPU.

That is, as you said, there are already devices, just developers can’t use them to the full yet. My hypothesis is that companies themselves do not yet understand how to do this carefully in the form of a framework. Or they just don’t want to give away in order to have a market advantage.

Alexei Kudryavtsev: With the fingerprint scanner, the same thing was in the IPhone, as I recall.

Andrey Volodin: He is not that super-affordable even now. You can use it top-level, but you can not get the print itself. You can just ask Apple to let the user use it. It's still not that full access to the scanner itself.

Hexagonal Accelerators

Daniil Popov: You mentioned the term hexagonal accelerators. I think not everyone knows what it is.

Andrey Volodin: This is just a piece of hardware architecture that Huawei uses. I must say, she is rather sophisticated. Few people know, but in some Huawei these processors are, but are not used, because they have a hardware bug. Huawei released them, and then found a problem, now in some phones special chips are dead weight. In fresh versions, everything already works.

In programming, there is the SIMD (Single Instruction, Multiple Data) paradigm, when the same instructions are executed in parallel on different data. The chip is designed in such a way that it can process some operation in parallel on several data streams at once. In particular, hexagonal means that on 6 elements in parallel.

Alexei Kudryavtsev: I thought that the GPU just works like this: it vectorizes a task and performs the same operation on different data. What's the Difference?

Andrey Volodin: GPU more general purpose. Despite the fact that programming for the GPU is rather low-level, with respect to working with coprocessors it is quite high-level. For programming on the GPU, a C-like language is used. On iOS, the code is still compiled with LLVM into machine instructions anyway. And these things for coprocessors are most often written directly hardcore - in assembler, on machine instructions. Therefore, there the increase in productivity is much more noticeable, because they are sharpened for specific operations. You can’t count on them anything at all, but you can count only what they were originally intended for.

Alexei Kudryavtsev: And why are they usually designed?

Andrey Volodin:Now mainly for the most common operations in neural networks: convolution - convolution or some kind of intermediate activation. They have pre-wired functionality that works super-fast. So they are much faster on some tasks than the GPU, but in all the rest they simply are not applicable.

Alexei Kudryavtsev: It looks like DSP processors, which used to be used for audio, and all plugins and effects worked on them very quickly. Special expensive hardware was sold, but then the processors grew up, and now we record and process podcasts directly on laptops.

Andrey Volodin: Yes, about the same.

GPU not only for graphics

Daniil Popov: I understand correctly that now on the GPU you can process data that is not directly related to graphics? It turns out that the GPU is losing its original purpose.

Andrey Volodin: Exactly. I often talk about this at conferences. The first were NVidia, who introduced CUDA. This is a technology that makes GPGPU (General-purpose computing on graphics processing units) simpler. You can write on it a superset of C ++ algorithms that are parallelized on the GPU.

But people have done this before. For example, craftsmen on OpenGL or on even older DirectX simply wrote data to the texture - each pixel was interpreted as data: the first 4 bytes in the first pixel, the second 4 bytes in the second. They processed the textures, then back the data from the texture was extracted and interpreted. It was very crutched and complicated. Now video cards support general purpose logic. You can feed any buffer in the GPU, describe your structures, even the hierarchy of structures in which they will refer to each other, calculate something and return it to the processor.

Daniil Popov: That is, we can say that the GPU is now Data PU.

Andrey Volodin: Yes, graphics on the GPU are sometimes processed less than general calculations.

Alexey Kudryavtsev:The architecture of the CPU and GPU is essentially different, but you can consider it both there and there.

Andrey Volodin : Indeed, in some ways the CPU is faster, in some ways the GPU. This is not to say that the GPU is always faster.

Daniil Popov: As far as I remember, if the task is to calculate something very different, then on the CPU it can be much faster.

Andrey Volodin: It also depends on the amount of data. There is always the overhead of transferring data from the CPU to the GPU and vice versa. If you consider, for example, a million elements, then using a GPU is usually justified. But counting a thousand elements on a CPU can be faster than just copying them to a graphics card. Therefore, you must always choose the task.

By the way, Core ML does it. Core ML can runtime, according to Apple, to choose where it is faster to calculate: on the processor or on the video card. I don’t know if this works in reality, but they say yes.

Hardcore GPU Engineer knowledge for a mobile developer

Alexey Kudryavtsev: Let's get back to mobile development. You are a GPU Engineer, you have tons of hardcore knowledge. How can this knowledge be applied to a mobile developer? For example, what do you see in UIKit that others do not see?

Andrey Volodin: I will talk about this in detailon AppsConf. You can apply a lot where. When I see, for example, how the UIKit API works, I can immediately understand why this is done and why. Observing the performance drop when rendering some views, I can understand the reason, because I know how the rendering is written inside. I understand: in order to display the effects that Gaussian blur actually does on top of the frame buffer, you first need to cache the entire texture, apply a heavy blur operation to it, return the result, finish rendering the rest of the views, and only then show it on the screen. All this must fit in 1/60 of a second, otherwise it will slow down.

It is absolutely obvious to me why this is a long time, but for my colleagues this is not clear. That is why I want to share the design tricks that we often use in GameDev, and my insights on how I look at problems and try to solve them. It will be an experiment, but I think it should be interesting.

Why Android doesn't have native blur

Daniil Popov: You mentioned the blur, and I had a question that worries, I think, all Android developers: why is there a native bluer in iOS and not in Android.

Andrei Volodin: I think this is because of architecture. Apple platforms use the Tiled Shading rendering architecture. With this approach, not the entire frame is rendered, but small tiles - squares, parts of the screen. This allows you to optimize the operation of the algorithm, because the main performance gain when using the GPU gives an efficient use of the cache. On iOS, the frame is often rendered so that it does not take up memory at all. For example, on the iPhone 7 Plus, the resolution is 1920 * 1080, which is about 2 million pixels. We multiply by 4 bytes per channel, it turns out around 20 megabytes per frame. 20 MB for simply storing the system frame buffer.

The Tiled Shading approach allows you to break this buffer into small pieces and render it a little bit. This greatly increases the number of cache accesses, because in order to blur, you need to read the already drawn pixels and calculate the Gaussian distribution on them. If you read the entire frame, the cache rate will be very low, because each stream will read different places. But if you read small pieces, then the cache rate will be very high, and productivity will also be high.

It seems to me that the lack of native blur in Android is connected with architectural features. Although, maybe this is a product solution.

Daniil Popov: In Android, there is RenderScript for this, but there you need to mix, draw, embed with your hands. This is much more complicated than setting one checkbox in iOS.

Andrey Volodin: Most likely, performance is also lower.

Daniil Popov: Yes, in order to satisfy designer Wishlist, we have to downscale the picture, bluing it, and then upscaling back in order to somehow save.

Andrey Volodin: By the way, with this you can do different tricks. Gaussian distribution is a blurred circle. Gauss sigma depends on the number of pixels you want them to collect. Often, as an optimization, you can downscale a picture and slightly narrow the sigma, and when you return the original scale, there will be no difference, because sigma directly depends on the size of the picture. We often use this trick inside to speed up the blur.

Daniil Popov:However, RenderScript in Android does not allow you to make a radius greater than 30.

Andrey Volodin: Actually, a radius of 30 is a lot. Again, I understand that collecting 30 pixels using a GPU on each thread is very expensive.

What are the similarities between mobile development and GameDev

Alexei Kudryavtsev: In the theses to your report, you say that mobile development and GameDev have a lot in common. Tell me a little, what exactly?

Andrey Volodin: The architecture of UIKit is very reminiscent of game engines, and the old ones. Modern ones have gone in the direction of the Entity Component System, and this will also be in the report. This also comes to UIKit, there are articles that write how you can design views on components. But it was invented in GameDev, the first time the Component System was used in the game Thief in '98.

Fundamentally, for example, Cocos2d, which I worked on for a long time, and the ideas that were used in the first implementation are very similar. Both there and there, a Scene graph is used - a scene tree, when each node has sub-nodes, their rendering occurs by accumulating affine transformations, which are called CGAffineTransform specifically on iOS. These are just 4 * 4 matrices that are multiplied to change the coordinate system. Animation everywhere is done about the same.

Both in game engines and in UIKit everything is built on time interpolation. We just interpolate some values - be it colors or positions between frames. The optimizations are the same: in GameDev it is customary not to do too much work, and UIKit uses setNeedsLayout, layoutIfNeeded.

I draw these parallels for myself constantly - between what I once did, and between what I see in the Apple framework. I’ll talk about this at AppsConf .

Daniil Popov: Indeed, the Cocos2d API is similar to iOS (for the UI). Do you think the developers were inspired by each other's work, or did it just work out architecturally?

Andrey Volodin: I think that they were inspired by something. Cocos2d appeared in 2008-2009, then UIKit was not the UIKit that we know now. It seems to me that some techniques were specially repeated there, so that people would be more comfortable working, so that they could draw parallels.

It's funny that the swings swung: initially, the Cocos2d core team borrowed a bit of Apple's ideas, and then Apple completely copied Cocos2d, right down to all the architectural solutions. SpriteKit is essentially a complete copy of all the ideas that appeared in Cocos2d. In this sense, Apple has taken its due.

Alexei Kudryavtsev: It seems to me that the same tricks as in UIKit in 2009 were still on MacOS, which has existed since ancient times. There are the same setNeedsLayout, layoutIfNeeded, there are affine transformations.

Andrei Volodin: Of course, but there is GameDev longer than MacOS.

Alexey Kudryavtsev: You can’t argue!

Andrey Volodin:Therefore, I do not compare Cocos2d with Apple frameworks, but rather consider in principle the paradigms that originated in GameDev. It was in GameDev that people first realized that inheritance is bad. When the whole world was enthusiastic about OOP, GameDev already began to think that inheritance was causing problems, and they came up with components. Mobile development, as an industry, has come to this just now.

Alexei Kudryavtsev: It seems that Alan Kay realized a long time ago that inheritance is bad.

Andrey Volodin : Yes, but on the whole, you must admit that a few years ago everyone said that OOP is cool. And now there is Protocol-Oriented Programming in Swift, functionalism, and everyone comes up with something new. In GameDev, these moods have appeared for a long time.

Alexey Kudryavtsev:I’ll make a remark: Alan Kay is the same person who came up with OOP. He said that he did not invent inheritance, but only sending messages, and in general he was misunderstood.

Differences between mobile development and GameDev

Alexei Kudryavtsev: Now tell me about the differences: how are GameDev and mobile development fundamentally different, and which of GameDev can we not use?

Andrey Volodin: It seems to me that the fundamental difference is that product development is as lazy as possible. We are trying to write code on the principle of "until asked, I will not get up." Until the callback works, we won’t do anything. Even rendering in product development is lazy: not the entire frame is redrawn, but only those parts that have changed.

GameDev development in this sense is merciless. Everything is done for each frame: 30 or 60 times per second the whole scene is redrawn from scratch, each frame, each object is updated, each frame is simulated by physics. A lot of things happen, and this changes the paradigm very much. You start living inside one frame - I have devoted a whole part of the report to this. You need to fit all-all-all in 1/60 or 1/30 of a second. Therefore, you start to contrive, do the maximum number of preliminary calculations, parallelization, while the GPU renders the frame, prepare the next one on the CPU. That is why the battery from games runs out much faster than from ordinary applications.

Alexei Kudryavtsev: And why in games you can’t do everything too lazily?

Andrey Volodin:The concept of games does not allow much. Some games could definitely benefit from this, for example, Tetris, in which there are few dynamics and only some parts change. But overall, the game is a very complex thing. Even when the character is just standing, he, for example, sways - some animations happen, there is some logic, physics is calculated. Saving can result in more harm, because each frame changes so much that it becomes almost impossible to reuse fragments.

In addition, there are hardwired restrictions. For example, the GPU works better with the float type than with the double type, because of this, accuracy is much lower. Therefore, for example, if you redraw only part of the screen, noticeable artifacts may occur. On the CPU, the accuracy is high, because everything is rendered in double precision there, you can use beautiful fonts and neat curves, but on the GPU there will still be some approximation.

The combination of these factors leads to the fact that each frame requires heavy calculations, updating all objects - in fact, drawing from scratch.

Classic development is much closer to GameDev than you think.

Daniil Popov: I want to discuss the provocative statement from your future report that "the classic development is much closer to GameDev than you think." I immediately remembered a series of articles about crutches in games that were designed to speed up development when they were running out of time. These articles give the impression that GameDev is a crutch on a crutch for the sake of optimizations. In the usual development, now everyone is obsessed with architecture, beautiful code. I can not correlate this with GameDev.

Andrei Volodin: Of course, enterprise companies don’t do that, but in indie GameDev it’s about that. But specifically this thesis is about something else. I often notice that developers use many of the concepts that are used in GameDev, but they don’t even understand it.

For example, affine transformations. Few can clearly say that this is just a multiplication of 4 * 4 matrices. More often, CGAffineTransform is an opaque data structure in which something is stored and it is not clear how it leads to the view being scaled.

In the report, I will try to show the other side of what we use every day, but at the same time, maybe we do not fully understand.

About the benefits of mathematics

Alexei Kudryavtsev: How can a mobile developer come to this understanding? How to figure out what is under the hood of rendering in UIKit, how are affine transformations arranged inside, and not be scared once again? I understand that this is a matrix, but what specific figure is responsible for what, I can not say. Where to get information so as not to be afraid and understand?

Andrey Volodin: The most obvious advice is to start making a pet project.

The main thing that is worth mentioning in this regard: all the concepts of the mobile GPU development are absolutely similar to those on the desktop. iOS GPU programming is not fundamentally different from what is in the desktop environment. Therefore, if for iOS there is a lack of material on the topic, then you can always read something for NVidia or AMD solutions and be inspired by them. Ideologically, they are exactly the same. The API is a little different, but it’s usually clear how to shift existing practices from desktop programming to mobile.

Alexei Kudryavtsev: When you use an API, for example, the Cocos2d or Unity game engine, you don’t understand anything early - you just pull some methods. How to begin to understand, and where is it better to see what is better to read, so that it can be transferred to UIKit?

Andrey Volodin:Cocos2d is an Open Source project and well written. I am not very objective, because I had a hand in this, but it seems to me that there is a pretty good code that can be read and inspired. It is written in the not very modern objective-C, but there are detailed comments on many difficult places.

But when I talk about pet project, I’m not talking about high-level projects like making a game, but about writing an API that makes, for example, a glitch effect. You know, there are popular APIs that make a VHS effect. And not on the processor, but on the GPU. This is a relatively simple task that can be done over the weekend. But it is not so simple if you have never tried it. When I did this for the first time, I learned amazing things: “This is how contrast and saturation work on Instagram, or lightroom presets!” It turns out that these are just shaders that multiply 4 numbers or raise to a power - that's all.

Directly rips off the tower from how simple it is.

You use it every day and take it for granted - it works, but you don’t understand how. Then you start doing it yourself, and it becomes fun at the same time from the fact that you are doing something supposedly complicated, but it’s also funny that in reality it’s so simple that it’s even funny.

Daniil Popov: Anyway, it seems to me that some kind of mathematical basis is needed. For example, in Cocos2d some shaders are literally 5 lines of code, and you sit and look at them like a ram at the gate, and you just don’t understand what is written there. Probably, it’s so easy not to dive into the language of shaders without knowing the mathematics, basic concepts, etc.

Andrey Volodin:I agree about mathematics. Without basic knowledge of linear algebra it will be hard, you will have to figure it out first. But at the same time, if you had a course in linear algebra at the university, and you at least roughly imagine what a scalar product is and its geometric meaning, what is a vector product and its geometric meaning, what is an eigenvector, normal, matrix, how matrix multiplication works, it will be quite simple to figure it out.

Daniel Popov: computer specialties students often whine that they do not need physics and mathematics. Probably, now many people hardly remember how matrix multiplication works.

Andrey Volodin:This is a sore point for me. I was the same, arrogantly ache, why do I need functional analysis and the like. But I have valuable life experience when I was interviewing Apple at the ARKit team. There was such a huge amount of mathematics at the interview that I later thanked myself for taking couples. If it were not for the background that I received at the university, I would never have answered these questions, and would never have understood how it works.

Now, when I myself am teaching at the university or arriving on an open day, I always say: “Friends, you will have enough time to sit in your IDE, please go to linal, matan and generally figure out what it is. In the age of machine learning, this will definitely come in handy. ”

Daniil Popov: The most important thing is that the interview was over?

Andrey Volodin:Yes, of course, and only because I had a mathematical background.

Alexei Kudryavtsev: Now you know why to learn matan, and where you can get after that.

Andrey Volodin: For example, without understanding the affine transformations and knowing what is normal, VR is far from going far. Even when you create a Project Template in Xcode, everything is already multiplied there, there are vector works, something is transposed. If you do not understand the concept, then even simple things cannot be done.

Daniil Popov: On this moral note, I propose to gradually finish.

Parting words

Alexei Kudryavtsev: Give some parting words to those who want to get to know GameDev and the GPU closer.

Andrei Volodin: Not all of it is necessary. This is not something classic that will definitely come in handy in the labor market, and just for people as individuals. But it seems to me that if you wandered a little into a dead end and don’t know where to continue self-development, already gone through everything you can, explored all possible approaches in the UI: modularization, acceleration of launch, runtime Objective-C - in general, you’ve already figured out everything, then this is a good new field. There is a lot of room for challenge. Although I do this every day, sometimes I have to open textbooks: you’ll sit and remember - yeah, X to Y, I get it!

If you want to challenge, strain your brain, do something new, then GameDev and GPU programming are for you.

If you also dreamed of making computer games in your childhood, it's time to start. Come listen to Andrei Volodin at the conference for mobile developers AppsConf April 22 and 23 in Moscow in Infospace.

Tags: