Automated sports production could be the next big thing in sports broadcasting.
Combined with OTT distribution, it could open the flood gates for around 200 million sporting events that are not broadcasted due to limited resources. To hit mainstream adoption, automated production technology needs to meet the quality thresholds spectators expect.
Although many variations of robotic capture technologies exist in the market, this article will focus on the following:
• Robotic capture (a.k.a PTZ - Pan Tilt Zoom) – this technology uses a robotic camera to follow the action with panning and zooming presenting a closer perspective. One key challenge is for the system to understand in real-time what to focus on while maintaining smooth human-like movements. If a goal keeper in a soccer game performed a long-range kick at the beginning of a play, the robotic capture will have a hard time maintaining the ball in range without fast movements.
• Panoramic capture with latency – panoramic capture uses wide-angle camera or multiple wide-angle cameras and then stitches these images together. The technology uses advanced auto-tracking algorithms to follow the flow of play within the hi-res panoramic capture. These Artificial Intelligence (AI) decisions need to be made within a 5-second latency buffer to allow accurate human-like capturing. Because much of today’s high-quality internet broadcasts encounter a latency of 20 seconds, this 5-second latency does not have a negative impact. The latency buffer, in essence, is like a camera operator who looks into the future to understand the flow of the play and then makes a conscious decision about how to shoot it.
Automatic Capture Technology
While varying in their specific algorithmic implementation, most automatic capture technologies share the following principles. However, because each sport has its own game logic each solution requires completely different algorithms:
• Automatic ball detection – in ball-based sports (e.g. soccer, basketball, etc.), usually the ball is located at the heart of the action. To capture the action, the algorithm tries to detect and follow the ball.
• Player(s) detection – in more advanced technologies automatic ball detection is complemented by the detection of players. This allows better understanding of the action and serves as a basis for game state detection. Both the ball and player detections are based on the ability to analyze the images and distinguish between the background and the objects of interest (i.e. the ball and the players).
One of the challenges in player detection is when a player is standing still for a few seconds. For example, in soccer, during a free kick, some of the players may not move for a period of about 30 seconds. When this occurs, the algorithm must ensure that the player will not blend in to the background.
In addition, the algorithm should be able to distinguish between players who are not active (e.g. waiting in the sidelines) from the active players, even when some of the active players are far from the ball. The algorithm should also be able to identify the referee, who is not part of the game play.
• Game state detection – Based on the ball and the player detection, the algorithm needs to identify the game state. The game state is the type of play currently happening. For example, a corner kick (in soccer), a counterattack, a free throw, a penalty kick, etc. Each game state has its own visual characteristics. By understanding the game state, the algorithm can predict certain behaviors and make smarter decisions about how to best capture the action. Every sport has a long list of different game states, making this a challenging task for an automated system.
To overcome this challenge, the game state detection may be based on Deep Learning algorithms, which can automatically learn how to identify a corner kick, based on a data set of perhaps 50 examples. In the case of deep learning, the system trainer doesn't have to come up with rules about how to identify the corner kick. The system will automatically generate its own rules and select its own characteristics to identify this specific game state.
By taking into account all these parameters, the system can make a decision on how to capture each frame. In the image below, we can see a visual representation of this decision process. It’s a panoramic image captured from multiple cameras that has been stitched together. The red rectangle represents the desired frame to capture. The system recognized it as an attack on right. Notice how the players who are not participating in the play are marked with X's, while the players participating in the play are marked within the frame, with data regarding their speed etc.
The Baseline Characteristics
To provide an engaging viewing experience, the technology needs to simulate the human camera operator capture with smooth, non-robotic movements, preferably simulating the movement of a video tripod with fluid head.
Automated Capture Scenarios
Scenario #1 – additional unreal ball (not in play)
A common scenario in lower-tier leagues is the existence of second ball that is not part of the game, such as another ball used for practice during the match. Focusing on this ball instead of the real ball is a mistake. In this case the primary ball disappears (is obstructed by something or someone), and a second unreal ball appears.
• Human camera operator – the human will undoubtedly notice that this ball is not relevant.
• Robotic capture – a robot capture camera that follows the ball can mistakenly think the second ball is the real ball and jump quickly to follow it. When the real ball appears again it may jump to it again.
• Panoramic with latency – using the 5-second latency, the system can learn the ball is not real. During this buffer, the AI may think the second ball is real, but when the real ball reappears and the second ball is out of play, the AI, like the human camera operator, will know that the first ball is the only real ball. From the spectators’ perspective, nothing happened as this mistake is resolved during the 5-second latency buffer.
Scenario #2 - unpredictable ball movement
In sports, such as basketball and soccer, etc., the ball movement is very often unpredictable. For example, during an unpredictable pass or a hard dribble, if the camera is in zoom-in, it can lose the tracking of the action in a split second.
• Human camera operator – for a human camera operator this scenario is challenging. This is where the expertise comes in on how to take the shot, how to move the camera, etc. Experienced camera operators have an improved sense of the game, making split-second decisions on how to best capture the moments.
• Robotic capture – for a robotic camera, this is the most challenging situation. No matter how strong its algorithms are, it is impossible to predict the ball movement in these situations and, of course, jumping too quickly will yield an undesired experience.
• Panoramic with latency – by combining high-res panoramic views of the entire field with the 5-second latency, the algorithm can make accurate decisions based on a view of the future (the latency buffer seconds), following the action even when there is an unpredictable ball movement. An algorithmic model can be created to simulate the camera movement of a video tripod with fluid head.
Scenario #3 – linear multiple simultaneous actions
In this scenario, a linear broadcast (no replay) may need to jump to multiple simultaneous actions. One action occurs near the ball, but the other isn’t. For example, during a free kick in soccer, as the shooter prepares for the kick, it is also interesting to show the goal keeper on the other side preparing for the catch.
• Human camera operator – it is impossible to capture the two moments with one camera.
• Robotic capture – it is impossible to capture the two moments with one robotic camera.
• Panoramic with latency – the latency allows a cut from one action to another in linear mode with one panoramic camera as if there were multiple cameras.
Scenario #4 – non-linear multiple simultaneous actions
Multiple actions occur at the same time, such as baseball, where, during a pitch, there can be movements between the bases. Usually, the broadcast will be non-linear, such as a replay of the action with the first one as live.
• Human camera operator – it is impossible to capture the two moments with one camera.
• Robotic capture – it is impossible to capture the two moments with one camera.
• Panoramic with latency – it is impossible to capture the two moments with one camera. In terms of the broadcast, it is possible to show a replay of the other action as if there were multiple cameras in the field.
The combination of a high-res panoramic view of the entire field with an insignificant latency of 5 seconds provide the ability to simulate the same capture experience as a human camera operator. In addition, such a solution can go beyond single camera capabilities, simulating multiple cameras with one camera head.
Today's extraordinary advancement in computing enable the placement of extensive AI and deep learning capabilities in the field (without using cloud resources due to latency). The combination of computer vision, AI, and deep learning provides, for the first time, a viewing experience that is comparable to a human camera operator and, in some respects, supersedes the abilities of a single camera operator. These algorithms will get better and will be able to cover more types of sporting event. Nevertheless, the basic capabilities have been already proved and tested in the field.