Building training and testing playgrounds to help advance sport analytics AI solutions out of the lab and into the real world is exceedingly challenging. In team-based sports, building correct playing strategy before the championship season is a key to success for any professional coach and club owner. While coaches strive at providing best tips and point out mistakes during the game, they still are incapable of noticing every detail and behavioral patterns of both teams while rewatching the matches. For being able to collect such data, analyze it and make inference about team behavior sophisticated AI algorithms can be used.
In particular, the types of the tasks we would like to solve fostering the analysis of the rugby team are the location of each player during the match and the 3D pose of each player on the field. Having such information in real-time will provide necessary evidence for building better playing strategy.
In many sport analytics cases a class of a problem has an efficient solution already discovered, but it cannot be efficiently applied - the main bottleneck is missing data.
The process of gathering and - especially - labeling data can be extremely expensive and time-consuming. The images must be manually analyzed by humans, whose labor in such repetitive tasks is not only slow and expensive, but also less accurate, compared to computers.
In addition, there are cases that require modern equipment for the production of labeled data and highly qualified specialists to maintain the production process. This case significantly increases the project cost or in many cases, makes the sports analytics project realization unaccessible for stakeholders.
Team-based sports – an attractive opportunity for Machine Learning and Computer Vision
What if we could generate automatically the imagery and video data suited perfectly for the task at hand with the complete and always correct ground truth built-in?
We would like to show our attempt to achieve exactly this on the example of football or rugby players 3D pose recognition. The goal is to train the AI model to accurately recognize the football players and their poses as human key points in 3D space on the real match footage. The AI models have been trained exclusively on artificial, synthetic data generated using SKY ENGINE AI platform and NVIDIA RTX machines. Resulting images are simulated scenes that are fully controlled by SKY ENGINE’s renderer, so all kinds of ground truths can be provided, depending on the model's requirements.
SKY ENGINE AI rendering engine with Nvidia RTX cores provide physically based rendering for deep learning. The heterogeneous system which consists of Nvidia Titan RTX and Nvidia Tesla V100 GPU enable very productive and powerful configuration which is able to simultaneously generate the labeled, multispectral (if needed) datasets and train the neural networks.
Main advantages of our approach include:
- Efficient dealing with unbalanced data
- Accurate detection of logotypes on uniforms and stadium
- Precise labells and 3D ground truth can be generated automatically
- AI models for human detection and pose estimation can be trained to operate in a specific, unusual environments with just a few lines of code
- Usually noisy, low quality data stream with compression artifacts does not significantly deteriorate the AI-driven inference accuracy
- Unknown parameters of broadcast cameras can be effectively derived
- High quality of 3D mapping available
- 3D pose estimation for small objects can be accurately carried out
- Complex structures of movements and formations can be accurately recognized
- Very efficient data processing and computation optimization with RTX architecture
Let’s take a look at the complete solution for the 3D pose estimation problem resolved in the SKY ENGINE AI platform.
Sport analytics case with SKY ENGINE AI platform
First, we have to configure the rendering engine, define a render datasource and train AI models for human detection and 3D pose estimation.
Assets loading and rendering engine configuration
We will start with loading the assets of the stadium's geometry. The assets are prepared in a standard 3D modelling software and loaded into SKY ENGINE in an Alembic format.
renderer_ctx.load_abc_scene('stadium')
renderer_ctx.setup()
Next, let’s display the loaded geometry of the stadium:
with example_assistant.get_visualizer() as visualizer:
visualizer(renderer_ctx.render_to_numpy())
Figure 1. Preview of the 3D stadium geometry in the SKY ENGINE AI platform. A simplified preview of stadium geometry with simple Phong shader, without materials. Grayscale image.
Next step requires loading textures for the geometries using the Python API:
stadium_base_textures = SubstanceTextureProvider(renderer_ctx, 'concrete')
stadium_base_params = PBRShader.create_parameter_provider(renderer_ctx, tex_scale=50)
renderer_ctx.set_material_definition('stadion_base_GEO', MaterialDefinition(stadium_base_textures, parameter_set=stadium_base_params))
As shown above, SKY ENGINE provides full support for the procedural textures (which brings to the users rapid generation of a variety of data) as well as physically based rendering (PBR shaders).
And let's define the environmental map as follows:
renderer_ctx.define_env(Background(renderer_ctx,
EnvMapMiss(renderer_ctx),
HdrTextureProvider(renderer_ctx, 'light_sky')))
Figure 2. Preview of rendered 3D stadium with textures of grass, crowd, sky, goal posts etc. Images of commercial logos are overlaid on the pitch. Color image.
At this point, we have the stadium already rendered in the scene and the next step would be to configure the entire scene and populate it with the players. We can do that using a convenient mechanism for instatnioning.
The SKY ENGINE renderer provides virtually endless possibilities to shuffle, multiply, randomize and organize the assets.
From a single Alembic animation of a certain player we are creating two teams of 20 players each.
renderer_ctx.layout().duplicate_subtree(renderer_ctx, 'player_GEO_NUL', suffix='team2')
renderer_ctx.layout().get_node('player_GEO_NUL').n_instances = 20
renderer_ctx.layout().get_node('player_GEO_NUL_team2').n_instances = 20
Figure 3. Preview of rendered 3D stadium populated with players in uniformly randomised positions with clothes in randomised colors and patterns. Color image.
By default, all the materials are drawn randomly. To create two proper teams we are ensuring that each player in a given team has the same color of the shirt while keeping all the other inputs random (hair, skin color, socks color, shirt number etc.).
To achieve this, we need to put the players into separate randomization groups and define their drawing strategy. The Substance archive input that controls the shirt’s color is called "Colors_select". It needs to be the same (synchronized) inside the randomization group and different between the groups. All the other inputs are kept randomized by default.
shirt_sync = SynchronizedInput(SynchronizationDescription(in_strategy=Synchronization.DISTINCT_EQUAL_GROUPS))
player_material_strategy = DrawingStrategy(renderer_ctx, inputs_strategies={'Colors_select': shirt_sync})
renderer_ctx.instancers['player_GEO'].modify_material_definition(strategy=player_material_strategy)
renderer_ctx.instancers['player_GEO_team2'].modify_material_definition(randomization_group='team2', strategy=player_material_strategy)
Figure 4. Preview of rendered 3D stadium populated including 3D players with synchronised team colors. Now, there are two teams visible where the clothes textures are identical for each team. Color image.
Looking closer at the picture above, one might notice that each player is in exactly the same pose. By default, SKY ENGINE plays animations from Alembic files frame by frame, so we need to randomize this parameter.
player_geometry_strategy = DrawingStrategy(renderer_ctx, frame_numbers_strategy=UniformRandomInput())
renderer_ctx.instancers['player_GEO'].modify_geometry_definition(strategy=player_geometry_strategy)
During the rugby game, players are not distributed uniformly - they tend to gather in a group, closer together. To make the scene look more natural, we can change the way the players' positions are drawn. Instead of drawing them uniformly, we can use random Gaussian distribution. It is double-random, because first 𝜇 and 𝜎 are drawn, and then the positions for players are drawn also randomly with these parameters.
gauss_strategy = DrawingStrategy(renderer_ctx, default_input_strategy=RandomGaussianRandomInput(sigma_relative_limits=(0.1, 0.2)))
renderer_ctx.layout().get_node('player_GEO_NUL').modify_locus_definition(strategy=gauss_strategy)
Figure 5. Preview of rendered 3D stadium in full color. Stadium includes pitch, tribunes with fans, advertisements, telebims, and is populated with players where position of players are randomised from the Gaussian distribution. Color image.
We will skip here the additional configuration of camera, lights and postprocessing, but we encourage the Reader to look for the details into our Github repository. Now let’s move to the configuration related to the scene semantic and ground truth.
The keypoints are already present in the animation of the player. SKY ENGINE by default calculates all the information about keypoints, if it receives them in the input assets, we just need to visualize them to be sure everything is configured correctly. Green keypoints are visible, red are hidden.
example_assistant.visualized_outputs = {SceneOutput.BEAUTY, SceneOutput.SEMANTIC, SceneOutput.KEYPOINTS}
Figure 6. Preview of stadium and players with visible overlays of 3D players and skeletons. Color image.
The scene looks correct, so we can create a renderer datasource for AI training.
datasource = MultiPurposeRendererDataSource(renderer_context=renderer_ctx, images_number=20, cache_folder_name='rugby_presentation_new')
The AI model training process
For the training phase we will use models and trainers implemented in the DeepSky library which constitutes part of the SKY ENGINE AI platform.
main_datasource = SEWrapperForDistancePose3D(datasource, imgs_transform=transform)
train_data_loader = DataLoader(dataset,
batch_size=Constants.TRAIN_BATCH_SIZE,
num_workers=Constants.NUM_WORKERS,
drop_last=Constants.DROP_LAST,
shuffle=Constants.VALID_SHUFFLE,
collate_fn=collate_fn)
model = get_pose_3d_model(main_datasource.joint_num, backbone_pretrained=True)
trainer = DefaultTrainer(
data_loader=train_data_loader, model=model, epochs=Constants.EPOCHS, save_freq=1,
valid_data_loader=valid_data_loader, optimizer=optimizer, evaluator=evaluator, scheduler=scheduler, serializer=serializer)
trainer.train()
Ok, now let's check the results achieved by training an AI model on synthetic data to validate that everything was configured correctly. After each epoch we save a checkpoint and produce some inference examples to be able to see the training progress.
show_jupyter_picture('gtc03_assets/trained/img2.png')
Figure 7. Inference of the AI model trained on 2000 synthetic images. The result after 90 epochs. Left Image shows ground truth, right the result of model inference. The difference between ground truth and inference is invisible. Color image.
Figure 8. AI models training in changing lighting, stadium branding and environmental conditions with different 3D poses and actions by the players. Video.
Results of the AI model on the real images
In the next step we will validate the results on a real video. First let’s use a pretrained model for a player detection to find bounding boxes (player detection tutorial has been presented at GTC 2019 and is available on SKY ENGINE AI github: https://github.com/skyengineai.
checkpoint = torch.load('gtc03_assets/trained/rugby_detection.pth.tar')
for k, v in sorted(checkpoint.items()):
checkpoint[''.join(['_model.', k])] = checkpoint.pop(k)
detection_model.load_state_dict(checkpoint)
detection_model = detection_model.to(device)
real_dataset = ImageInferenceDatasource(dir='gtc03_assets/real_data', extension='png')
out = outputs.pop()
bboxes = out['boxes'].cpu().detach().numpy()
bboxes = bboxes[np.where(labels == 1)[0]]
labels = out['labels'].cpu().detach().numpy()
bbox_image = bboxes_viz(orig_img, bboxes)
Figure 9. Rugby players detected on the real rugby game footage by the AI model trained in SKY ENGINE AI platform using the Nvidia RTX machine. Bounding boxes (in yellow color) around players detected on a real full color video frame captured during a real rugby game. Color image.
model.eval()
with torch.no_grad():
results = model((img,), ({'boxes': torch.from_numpy(bboxes).int()},))
results = results.pop()
output_coords, output_bboxes = results['pred_poses_coords'].cpu(), \
results['boxes'].cpu()
Let’s look at a few examples. We can notice below, that despite extremely low quality of available data, which is a consequence of capturing live TV broadcast of inferior resolution and strong compression, the SKY ENGINE AI was able to train one of it’s key point AI models to detect players and correctly estimate their 3D coordinates of the skeleton joints. Such a task, without using a synthetic data approach with perfect ground truths, would be almost impossible in a conventional approach using real footage for AI model’s training.
Figure 10. Low quality, cropped images of the players with accurately estimated 3D skeletons derived from these poor quality imagery.
Conclusions
The 3D pose estimation is one of the most complicated computer vision tasks, which usually requires high quality images, calibrated cameras and perfect lighting conditions. On the other side training of the pose estimation algorithms for sport analytics would require extremely costly motion capture sessions with sophisticated equipment mounted on the pitch. We have just presented how the problem has been solved with simple 3D assets and SKY ENGINE AI platform working on top of NVIDIA hardware.
The SKY ENGINE AI tools serve to build such applications for team-based sports that will likely revolutionise these games. The players, coaches, clubs, decision-makers, fans, and broadcasters can potentially benefit from further democratisation of these sports (e.g., instead of relying on arbitrary judgement from individual scouts or experts, one may use SKY ENGINE AI technology to rapidly assess and quantify skills of players from under-represented regions or countries, or players from lower leagues, etc.). Moreover this approach can be easily replicated to train models to detect humans, estimate their position and analyse their movements in any conditions regardless of the environment, be it a factory, workshop or space station.
This article has also been published by NVIDIA on Developer Blog.