How do we replace a sports scout with a neural network
Yes, indeed, we were able to replace the sports scout's neural network and began to automatically collect data about the game. And now we know about the sporting competition more of the audience present, and sometimes the judge.
We ( Constanta ) specialize in the development of betting IT products: mobile applications, websites and, more recently, we develop projects in the field of computer vision and machine learning. About one of them will be discussed.
While athletes are fighting for big and small victories, the bookmaker needs to know the course of events in real time in order to recalculate the coefficients for which the bets are actually accepted. For this, sports scouts collect and transfer a large amount of data using a special application on a smartphone directly on the playing fields. A scout is the same person as all of us, therefore the risks associated with the human factor naturally occur. Our goal is to minimize them, at the same time increasing the volume and efficiency of data collection and transmission, plus - to reduce the cost of all this work. A small ball flies over a tennis table or a ball over a football field - the technical side of the implementation of the computer vision system for data collection has no conceptual differences. We decided
I need the coordinates and speeds of all your balls and your cue.
Note that in the analysis of many sports games to correctly determine the result, it is necessary to accurately trace the chain of events. This results in high requirements for the reliability of the components responsible for determining these events. Let us explain with a simple example: if in billiards, on average, players roll all balls into pockets for 20 shots, then, with a reliable determination of the outcome of a strike, 99% of the probability of determining the winner in a rally is only about 82% (0.99 20 ≈ 0.817). The match lasts up to five wins for one of the players, that is, there are only 5 to 9 draws, an average of 7. Thus, on average, with such reliability of determining the events, the correct match result is obtained with only about 24% probability (0.8177 ≈0.24). But initially the probability of error was only 1%!
Of all the variety of billiard games, consider Pool-9. In the rally, the player who wins the ball with the number 9 in the pocket wins. Initially, the "nine" is located in the center of the rhombus of colored balls. The object ball, which the cue ball should hit, is the ball with the smallest number on the table. If a player failed to score a single colored ball as a result of a hit or committed a foul, for example, did not hit the aiming ball or scored the cue ball in the pocket, the move goes to the opponent. To correctly score a point, it is necessary to determine whether the balls hit the pockets and all events leading to a player change.
First, let's talk about how the neural network receives data. The input information stream is a video broadcast from a single camera located above the table and shooting at 60 frames per second.
An example of a video stream frame processed by the system.
The key stage in the processing of a video stream by a neural network is semantic segmentation. This is a classic computer vision problem, consisting in the fact that the algorithm must assign the image pixels to one or several classes. Simply put, on video frames you need to determine what is what. The neural network produces “masks”, highlighting the pixels related, for example, to the ball or the player. Having passed through a series of post-processing algorithms, the “mask” of the balls turns into coordinates. After smoothing them, the filter for each ball determines the speed and trajectory of motion. At this stage, low-level, or intermediate, events are tracked, such as collisions of balls between themselves and with the sides of the table. The received data is sent to the rule processing module, which implements the entire logic of the game. As a result, it gives out to the final consumer, i.e. bookmaker
The general scheme of the system.
To solve the problem, it is first necessary to find the location of the table on the frame and all the balls on it. Another important participant in the action is the cue, it is he who determines the direction of the strike and, accordingly, the trajectory of the cue ball. Players lean over the table, partially closing it from the camera. From the point of view of the analysis of the game, they are “foreign objects”, as well as a stand for balls, as well as mobile phones, gloves, napkins and other things that, at the behest of the players, appear on the sides of the table. Thus, there are several target classes for semantic image segmentation: the table, its sides, pockets, cue, foreign objects and, of course, balls. In addition, each ball is represented by a separate class, depending on its color.
For semantic segmentation, a fully convolutional neural network with LinkNet-34 architecture is used . It works relatively quickly and has established itself well in various “combat” tasks of computer vision competitions. To determine the above-mentioned set of classes, only one neural network is used, which solves all the tasks of computer vision.
LinkNet-34 network architecture (see arXiv ).
Images are fed to the input, and the output is a stack of “masks” of all the required classes. “Prediction masks” are two-dimensional arrays of numbers with values from 0 to 1. The size of each element of the “mask” corresponds to the network's confidence that the corresponding pixel belongs to the class of this “mask”. For the final pixel classification, the predictions obtained are binarized by a threshold filter.
To teach a neural network to classify pixels on a large number of examples with appropriate “masks”. To do this, we collected a lot of videos, divided into frames, and the markup department manually prepared “masks” for them. In complex cases, additional data sets were required. For example, when a ball “dives” into a pocket or stands near it next to the board of a table, a shadow falls on it, due to which the colors look different. Or, when a player breaks a diamond, the balls fly quickly along difficult trajectories, due to which their images are smeared. If the neural network “saw” few such examples, the correct classification would be difficult.
An example of an image and its corresponding markup. The task of the neural network is to receive such “masks” from the input image.
Fast, faster, even faster ...
The final data consumer needs information in real time (and even better - faster rial-time). Several techniques have been used to speed up the neural network, such as combining packet normalization with 2D convolution (BatchNorm Fusion), which allows to obtain an equivalent network without several layers. A good result is also provided by the preparation and loading of a new frame in parallel with the processing of the previous one on the video card. In addition, gpu performs part of preparatory operations with frames and post-processing of “masks”. Even a simple idea helped to reduce the total time for processing each frame - transferring the result of the network from a video card to RAM after binarization in the form of uint8 instead of float32 received from the network.
As a result, the semantic segmentation of one frame with all the required pre- and post-processing takes on average only 17 ms! And for the operation of the system, only one gaming video card is sufficient.
Was there a collision?
We define the coordinates of the balls by “masks”, but first we need to exclude what only resembles a ball, for example, round stripes on players' T-shirts. This is where heuristics come into play: the shape and size of the balls, their position relative to the past, well known, are checked. Further, if everything is in order with the “mask”, its centroid is taken for processing.
Billiards player in scary dreams of developers.
At first glance, it is strange, but the fact is that the result of determining the position of the balls may differ between frames, even with fixed balls. The explanation is simple - “noise” of the real video, compression artifacts of the video stream, which, together with the error in determining the position of blurred images of moving balls, leads to the need to smooth the results.
According to the coordinates of the balls received from the network and determined on the previous frames, the velocity is estimated as a numerical derivative. The number of points taken into account and the interval between them are selected adaptively in the process of the system, depending on the availability of data and events such as collisions. Then information about the position and speed of the balls is sent to the sigma-point Kalman filter . It allows you to smooth out noisy data, which is especially important for determining the speed and its direction. In addition, the result of the dynamic model of it can be used to predict the near future.
Demonstration of smoothing the determination of the position and velocity of balls by a Kalman filter.
Left:Raw: the result of direct measurement, the vectors of the balls correspond to the speed, the numbers indicate the estimate of the speed; UKF: the result of the filter.
Right: an example of smoothing the direction of the velocity of the ball with a Kalman filter. The blue color shows the measurement results, the red one - the result of filtering. The sharp jumps of the direction correspond to the collisions of the ball.
Data on the state and trajectory of the balls allows you to determine the onset of the so-called low-level event, even when it fell “between frames”.
The balls during a strike move so quickly that there is often no frame in which the event itself is visible, for example, a collision of balls. Therefore, for all types of interactions (collision of balls between themselves, with the board or hitting the hole), a list of possible events is first constructed. There are two criteria here. First, the critically close mutual arrangement of the balls. When moving slowly, a large relative error in determining the speed and trajectory occurs, therefore the distance between the interacting objects is important. Secondly, at a high speed of movement of the balls, possible events are determined by the intersection of the trajectories obtained from the dynamic model. This approach gives a very nice bonus: the ability to predict in advance the likely ball hit in the pocket.
Sequential frames of the video stream during the initial break of a diamond of balls. Without a model that describes the trajectories of the balls, it is difficult to determine which ball the cue-ball collided with.
A change in the direction and magnitude of the velocity vector makes it possible to judge that the event, namely the collision, has occurred. In the case when the ball rolled into the pocket, it “disappears”. But there is an important point: it is necessary to use the data on its trajectory and check that the ball was precisely scored, and did not disappear from the camera’s view of the player’s hand or some other object.
And if something went wrong? For example, some of the events were due to the loss of a frame or a player figure hanging over the table. Such omissions are critical to game logic. It is saved by a heuristic auto-correction system that increases the stability of the system. For example, if a hit on the cue ball is determined and the object ball falls into the pocket, but no collisions of the cue ball are recorded and the other balls remain motionless, it is logical to add a collision of the cue ball with the object ball.
So we play or not?
The balls roll, collide, fall into the pockets ... Is the game really going at this moment? Or, on the contrary, everything on the table seems unshakable ... So the game has stopped? The correct answer to these questions is probably as important as the definition of collisions. When a player prepares to strike, ponders it, aims, there is no movement. But it happens and vice versa, in non-playing moments life on the table can go quite dynamically: balls scored are moved from one pocket to another, the cue ball can roll on the table when it is installed after a foul, and this is done including the tip of the cue (very much like a blow! ). If the end of the current strike can be easily determined - after a correct strike, all the balls stop moving, then with the start everything is not so clear. Of course, you can train the neural network to detect events in the video, including real cue strikes. And you can make a set of heuristics that analyze the position and angle of the cue, the trajectory and speed of its ends and the cue-ball trajectory after the intended impact. We went the second way, and the result was a very fast and reliable algorithm that determines the current state of the game.
The system is trying to understand whether the game has started or not.
And who won in the end?
All data about low-level events (hitting the cue ball, position and collision of balls, falling into pockets) is sent to the module, which by their sequence determines that a foul or correct ball fall into the pocket, a transition or an end to the game has occurred. The unbiased module keeps score and announces the winner. Its peculiarity is that it works without any autocorrections and heuristics, just formally applying the rules of the game. The block with the rules can be completely replaced, which allows it to be adapted to the local tournament rules or used for processing other types of billiard games without significant intervention in the system.
As unmanned vehicles have not yet completely got rid of a test engineer in the cabin, who monitors safety, so our rule module allows external manual control via a web interface. Intervention may be required if the automatic system fails. In addition, manual input of data that is absent in the video stream is required: a novice player, special hits that are announced during a voice game, etc. One person can potentially watch several games at the same time.
How it works
After successfully launching and setting up the system, we not only began to receive the required data, but moreover we found a lot of interesting things. So, sometimes the judge, being at the table, cannot precisely determine whether the cue-ball hit the aiming ball or on the other side. At the same time, the objective view of our system allows us to see how the situation actually developed. In addition, the system collects a lot of useful information for further analysis that a person is simply not able to determine and transmit in real time: the position and speed of the balls, the parameters of the cue strikes by each of the players.
Currently, the system works and is used by the bookmaker. In the future it is planned to improve the system, including adding automatic identification of players, automatic determination of the results of the first strike.
Technical visualization of how the system works. The ball near the “Cue ball” refers to which first ball the cue ball faced; “State” - the state of the system: there may be a “wait” - until the player has hit and “play” - while the balls are in motion; “Player” is the current player; the numbers around the balls indicate the velocity in cm / s.