In this article, you'll discover how to think about your machine learning models from a data-centric standpoint, stressing the relevance and value of data in the AI models creation process. The focus of data-centric AI is on methodically iterating on the data to enhance its quality and/or to provide high quality data (from neural network's perspective) initially in order to increase performance; regardless iteration or generation – it is a continuous process that you undertake not just at the start but even after deployment into production. We added there another critical component which is synthetic data generation to make it more inclusive in Data-driven AI strategy and to further boost the dataset's quality in terms of AI model's accuracy.
An old paradigm in Machine Learning
The Data Science community has a long history of creating and deploying datasets for AI systems. However, this undertaking is frequently painful and costly. The community needs high-productivity and efficient open data engineering tools that make creating, managing, and analysing datasets easier, less expensive, and more repeatable. As a result, the primary goal is to democratize data engineering and assessment in order to expedite dataset development and iteration while also boosting the efficiency of usage. If data preparation accounts for 80% of machine learning labor, then ensuring data quality is the most critical task of a data science team. Human-labeled data is yet fueling AI-based systems and applications, despite the fact that most inventive initiatives have concentrated on AI models and code improvements. Because annotators are the source of data and ground truth, the increased emphasis on volume, speed, and cost of developing and enhancing datasets has had an influence on quality, which is ambiguous and frequently circularly defined. The development of methods to make repeatable and systematic tuning and balancing of the datasets has also lagged. While dataset quality remains everyone's top priority, the methods by which it is assessed in practice are poorly understood and, in many cases, incorrect. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets, quality issues, benchmarks limitations, reproducibility issues in ML research, lack of documentation and replication of data, and unrealistic performance metrics.
Beyond a certain point, the present model-centric approach to the data science projects tends to face a brick wall. When your data does not enable you to go farther, due to e.g., its quantity and/or quality, there is only so much you can do by trying with many different AI model architectures or tweaking current one. Experimenting with various models and determining what is best suited for the given data, machine learning problem and business case does not solve the issue in the long term. If your best model does not satisfy the metric that the business requires to approve or roll out the project, it's probably wise to look into the data and dig deeper to determine why the data quality is excluding it from the training set.
In addition, when you train an AI model using data that is statistically different from the real data such inconsistency is making it failing to generalize to the real use case. It might appear in unexpectedly subtle ways or is frequently neglected when gathering true representative data is thought too difficult. This problem is especially serious since the validation dataset, which is used to evaluate the quality of the AI model, will most likely also carry the issues seen in the training data, resulting in believing you appear to be doing well, yet this is not the reality.
The SKY ENGINE AI platform brings synergy adding value to both approaches: Model- and Data-centric AI in a single platform for Data Scientists and Developers. On one hand SKY ENGINE AI enables to generate a balanced high quality dataset which covers the edge cases, on the other the iterative data generation process enables model architecture optimization and characteristics of specific AI models can be effectively detailed.
Even though all of the training data is consistent with the real data, some of it may be missing from a training dataset. In reality, not all occurrences come on an equal basis. Some are uncommon, while others are frequent, therefore while collecting data, a lot of it is acquired for typical scenarios and relatively little for unusual ones. In those few circumstances, this immediately leads to poor AI model's performance. In reality, after sufficient data for typical scenarios has been acquired, additional data in these circumstances no longer leads to greater performance. In many AI-driven computer vision applications these rare occurrences clearly have the largest influence on the metrics. For instance, additional thousand hours of video recorded from cars driving in safe and unchallenging conditions is unlikely to improve any autonomous vehicle's self-driving performance because it currently performs relatively well; however, two minutes of footage immediately before crashes could.
The data-centric strategy adds tremendous value to industry use cases
As AI spreads throughout sectors, a data-centric strategy is especially effective in use cases with a small amount of available representative and labelled data. Healthcare, manufacturing, and agriculture are examples of industries that frequently deal with relatively limited datasets or massive unlabeled datasets with few domain specialists. A data-driven approach is especially important when dealing with unstructured data (images and video data in computer vision applications), which is widespread in the sectors described above.
It is expected to most likely see more synergy — model-based techniques that reduce reliance on labelled data combined with technologies that improve the quality of the limited unbalanced labelled dataset you have. And growing use of tools for synthetic data generation to close the accuracy gap with simultaneous acceleration of the AI models production. For example, a typical manufacturing company may have hundreds of possible defect detection use cases, each of which, if successfully resolved, might result in millions of dollars in savings. However, many businesses lack the data science resources to filter each and every dataset to assure good quality. Methods like learning on synthetic data can be used within a data-centric framework to get the desired performance output.
A new order in the AI models training
Over the last few years, nearly all academic research in deep learning has focused on the AI model architecture and design and related advancements. Pre-processing/cleaning and annotating is also regarded as tedious and dull by many data scientists. The adoption of a data-centric strategy within the deep learning community intends to spur greater innovation in academia and business in areas such as data collection and generation, labeling, augmentation, data quality evaluation, data debt, and data management.
When you have some leeway in the data gathering process, you can make considerable improvements by paying attention to how you collect the data. In general, acquiring high-quality data that accurately reflects your use case is far more essential than just getting as much as possible from a more convenient source. The ideal situation would be to produce data using the same technique that would be utilized in the final model. The SKY ENGINE AI platform perfectly serves this purpose, boosting the accuracy of the AI models by synthetic data generation in multiple modalities including ground truths and by applying advanced domain adaptation algorithms.
Making labelling standards consistent, tossing away noisy samples, and employing error analysis to focus on a subset of data to improve are all methods that we can do now. Annotating conventions may differ amongst labellers. This alone can result in inconsistent labelling, lowering the AI model's performance.
In the automotive industry, we have seen indications of a data-centric AI strategy, albeit they may not have been marked as such. During the peak of the pandemic, when lockdowns were in effect, several autonomous driving businesses were unable to go out and collect new data (they deploy two safety drivers per car to prevent crashes due to the lapse of one driver and it would have been impossible to social distance). These companies revisited their initial approach and worked on generating synthetic data to improve performance instead, and in SKY ENGINE AI we have also served them with a synthetic data approach using simulation of other than visible light modalities like infrared (IR), radar or lidar.
“Improving the data is not a preprocessing step that you do once. It’s part of the iterative process of model development.
Andrew Ng
In general, the more experienced a Data Scientist or Data Engineer is, the better they are at detecting anomalies in data and iteratively improving it. The goal of this data-driven endeavor, however, is to build concepts, reliable metrics on quality and tools that may make this process more systematic, resulting in more efficient and successful AI/ML development across industries.
In addition, covering the corner cases is critical and can be a key differentiator for i.e. homeland security tenders. For this purpose, with the SKY ENGINE AI platform it is possible to automatically or semi-automatically check generated samples to detect where the model is drifting in production to balance the dataset and ensure proper error evaluation. The system can then increase the probability of an occurrence of such scene configuration to ensure the edge cases will be well represented in the dataset in further stages of training.
Some early innovations in this space that are very promising include SKY ENGINE AI's approach that is pioneering data simulations, synthesis and generation and AI models training in a closed feedback loop to ensure well-balanced dataset, potentially leading to a very high accuracy of computer vision tasks. We are at the forefront of the proposed here approach combining Model- and Data-centric AI into a single strategy with the SKY ENGINE AI platform being its central point with an evolutionary AI engine.
How much Data is enough? (to train accurate AI model)
We frequently strive to determine the appropriate quantity of data required to develop useful AI models. The notion that 'the more, the better' is not always correct. The amount of data is surpassed by the quality of data. There might be millions of images in a dataset with a lot of inaccurately annotated, noisy samples, resulting in an affected learning process. Contrary, a smaller dataset with high-quality annotations and ground-truths produces far better results. In SKY ENGINE AI we are providing a full stack synthetic data simulation and generation platform to address this, where quality data is available in endless quantities.
Based on: DeepLearningAI
Deep learning models are hungry or rather extremely hungry for data and there should be sufficient amounts of representative data for considered problems available. The deep neural networks are low bias high variance gizmos and we also believe that more data provides the solution to the variance problem. However, mindlessly accumulating additional data can be inefficient and expensive. Type of data required to be added can be assessed using error analysis of the AI model.
Ensure Data Quality is High
Real data is usually untidy, inaccurate, and noisy, and it needs (sometimes very complicated) annotation, which if done incorrectly can deepen the AI model's performance issues. To get the desired outcomes, the process must be carried out with great caution. While enormous amounts of real data can alleviate some of the problems associated with low quality data, this is often ineffective. When your model simply looks at relevant data, it will be much easier for it to detect patterns. Inclusion of a large amount of useless data or data that is inconsistently prepared obscures any patterns your model needs to detect. But even though additional hours spent on curating the datasets might pay off this process is still laborious and costly.
Data labeling cannot be neglected as it's of very high importance to keep labels accurate and consistent. Unfortunately, various data engineers may have different viewpoints on the appropriate data annotations and the quality of their work is often questionable. For instance, which are the correct bounding boxes for the situation in the images below?
Source: MLOps: From Model-centric to Data-centric, DeepLearningAI, Andrew Ng
A new study finds that around 3.4 percent of instances in popular and frequently used datasets were wrongly labelled, and that larger AI models are more affected by this. Observation of such errors raises serious concerns about the multitude of research publications that outperform prior standards by a percentage or two!
To address such issues is to make sure you have clear labeling guidelines and to use synthetic data generation for images and ground truths: accurate semantic masks, instances and bounding boxes. That said the synthetic training data screening can be less constrained to seek annotation errors. However, validation datasets should always undergo careful inspection as they are mimicking real-world scenarios.
Another aspect of data quality is related to the fact that training data should be representative and diverse, sufficiently describing the problem under consideration and covering as many variations as possible that can be found in validation and deployment data. Preferably, any data qualities that are not causal features should be sufficiently randomised. Typical issues with datasets can be:
- Spurious correlations: Neural networks often fail to learn when a non-causal attributes are linked with the annotations. For instance, if sheeps usually appear on the meadows, ML models may learn to associate background with class of objects.
- Lack of variation: when a non-causal attribute i.e. image brightness fails to vary enough in the dataset, neural network can overfit to the distribution of that attribute, failing to generalize well. Models trained on data gathered during the day will likely fail to perform well in darkness and vice versa.
Apart from acquiring new real-world data with more diversity, synthetic data simulation and generation provides a viable strategy to deal with spurious correlations and lack of variation issues.
How to get to the robust deep learning?
In a commonly accepted AI models training workflow analysis of training and inference results can lead to more data collection and model's re-training. Observing and rectifying errors in validation data is a useful approach. However, it simply fixes the problems identified by error analysis and does not protect against future problems. The same spurious correlation that is present in training data may also exist in validation data. Here, the SKY ENGINE AI platform's Evolve engine plays a critical role bringing a more proactive approach towards the AI model’s robustness when operating in real-world environments. It can spot the cases where the AI model's performance has deteriorated and simulates/generates extra data to better cover those ambiguous situations and finally re-trains the AI models for improvement in these cases. In addition, the platform makes high quality data available through all stages of the ML project lifecycle.
Full stack SKY ENGINE AI workflow for accelerated AI models training and synthetic data simulation and generation embracing Model- and Data-centric AI strategy in a single platform for Data Scientists.
In general, deep learning problems often exist within the data-centric area – datasets are unstructured or simply do not exist, there is lack of domain-experts, and labeled data is difficult and costly to obtain. In industrial applications usually there is a lot of unlabeled data available, but there are few domain specialists to be found. Such situation is usually observed in medical diagnostics i.e., radiology using X-rays, ultrasound or magnetic resonance imaging (MRI).
The SKY ENGINE AI platform is equipped with a set of tools and methods from both model-based and data-centric approaches. The aim is to relax the dependency on labeled real-world data by taking advantage of existing or solved AI tasks and employ the use of synthetic data in the process with closed feedback loop between AI model creation and data generation and to increase the amount of training data with perfect labels and ground truths. Following approaches and methodologies for the training of the AI models on synthetic data are possible with SKY ENGINE AI platform:
Transfer learning
Involves the transfer of knowledge across domains or tasks. It challenges the common assumption that both training and test data should be drawn from the same distribution (Zhuang et al., 2020). Transfer of knowledge includes instances, feature representations, AI model parameters or relations.
A common use-case involves adopting an architecture like AlexNet, pretrainng it on ImageNet, and replacing the last fully connected layers with the new ones from the target domain. The modified architecture can be then tweaked on target domain labels for classification.
SKY ENGINE AI is equipped with the pretrained AI model architectures which can be trained for the final tasks on the synthetic data (or a mix of real and synthetic data) generated by the system that significantly reduces the amount of the assets required for the preparation of the 3D scene for the data generation.
Data augmentation
Data modifications can be a way to provide new data points to help create a robust AI model. And we can apply multiple transformations to imagery data to change it but the structure can be preserved. This can lead to generating a near-infinite number of examples. Next, we can train an AI model to be sensitive to some aspects of the problem which is obvious for us, for instance, that rotations do not deform the real-world object depicted in an image. In the SKY ENGINE AI platform synthetic 3D objects or even entire scenes can be easily augmented at generation step and ground truth simply follows. In fact, the entire virtual environment where the AI models are trained can be continuously augmented in this process.
Self-supervised learning
Self-supervised learning is a method of machine learning. It learns from unlabeled sample data and is very useful to solve problems with a very limited amount of data. Self-supervised learning algorithms consist of two stages: in the first stage the learning model is being pre-trained by solving pretext tasks leading to useful self-supervised representations; in the second stage the model is being trained to solve the final task. Synthetic data approach in a case of self-supervised learning enables a precise implementation of pretext tasks where specific 3D objects can be viewed under a variety of angles or lighting settings. Next, a model pre-trained this way can be used for final training on a limited subset of real data. An efficient architecture for self-supervised learning for computer vision are architectures based on i.e. Siamese networks.
Semi-supervised learning
Is a combination of both supervised and unsupervised learning techniques. It takes advantage of the information found in the labels or cluster to improve performance (Zhu, 2005). For a supervised classification task, we use the implicit cluster information found from unlabeled data. For unsupervised classification, we harness the existing label-information found in the dataset. SSL requires that the distribution of the input contains some information about its output (Van Engelen and Hoos, 2020).
Multi-task learning
Assumes concurrent learning two or more related tasks (Ruder, 2017b, , Crawshaw, 2020). The goal is to “improve generalization by leveraging domain-specific information found in the training signals of related tasks (Caruana, 1997).” MTL forces the AI models to regularize via (implicit) data augmentation, focusing attention, eavesdropping, and adding representation bias (Ruder, 2017b).
Active learning
Conventional active learning approach attempts to overcome the labelling bottleneck by involving a domain-expert to annotate selected instances which is expected to increase AI model's accuracy. The motivation is to have the learning algorithm choose the data from which it learns, thus lowering training costs (Cohn et al., 1996). In reality, there are several challenges associated with this approach when applied in practice including lack of batch querying, noisy and unreliable domain-experts, and variable learning costs (Settles, 2011).
Weakly supervised learning
Class of methods where training data can be inexact, or innacurate (Zhou, 2018). Inaccurate samples occur when only coarse-grained information is provided. For example, predicting if a new molecule can make a special drug is based on its specific configuration. However, domain experts only know if a molecule is “qualified” to produce one, not necessarily which shape is decisive (Dietterich, 1997).
Synthetic Data simulation and generation
Currently, nearly all of the AI/ML models are developed using actual or representative data. There is not enough unique data available to create performant models (e.g. it takes roughly 50M pieces of data to create a 60-70% performant model). Additionally, this data must be labeled; synthetically generated data has the ability to be labeled as it is generated, reducing human data labeling effort for real-world data and data generated from an external source.
Sensor Synthetic Data Generation in the SKY ENGINE AI platform can enable the sensor simulations that can augment the limited, labeled, training data available to support AI/ML models development. SKY ENGINE AI can create case-tailored synthetic data in the following modalities: visual light, stereoscopic systems, infrared sensors, radars or lidars.
Synthetic data simulations and processing by the SKY ENGINE AI platform deliver balanced and perfectly labelled datasets. See the examples below:
Driver Monitoring System (DMS) synthetic data simulation for AI models training in Computer Vision in-cabin safety system (automotive industry case)
Virtual warehouse and synthetic data simulation, including semantic segmentation masks, for the AI models training in Computer Vision and Robotics (inventorying, supply chain, logistics)
Radio box on the telecommunication pole – 5G network performance optimization driven by AI and Computer Vision with Synthetic Data and virtual environments generation
Key features of SKY ENGINE AI Platform data generation modules
-
Hyperspectral, physics-based ray tracing, optimised for latest GPU architectures
Hyperspectral imaging simulations take several images of a scene in a different wavelength range. Combined, the images provide a greater depth of information. This technology and its applications are enabled in the areas where ingredients or substances need to be identified and discerned and are not recognizable through a standard color or monochrome picture. For instance, in the food or wood industry, in recycling, mining or agriculture.
-
Render passes dedicated for deep learning
Along with standard render pass which is a simulation of a scene captured by a certain detector (like visual or IR camera) SKY ENGINE generates render passes specific for certain machine learning tasks, like 2D or 3D semantic masks, 3D keypoints, depth fields, importance heatmaps etc.
-
Determinism and advanced machinery for randomisation strategies of scene parameters for active learning approach
Determinism and stability of random number generators and distribution sampling for stochastic processes are super important in optimisation of machine learning models and impossible to achieve with conventional rendering solutions. The entire infrastructure of SKY ENGINE platform and all modules are designed to deliver stability and reproductivity of machine learning experiments.
-
Support for Nvidia Material Definition Language and Adobe Substance textures
SKY ENGINE has implemented the Nvidia Material definition language to achieve photorealistic materials quality. It has integrated Adobe Substance libraries to render generative materials created in Substance Designer software.
-
Animation and motion capture data support
SKY ENGINE supports well-known formats of animation, including Sony Pictures Alembic format, as well as provides modules and helpers enabling quick definition of animations and trajectories directly from Python code along with easy connection to parameter randomisation, differentiation and active learning tools.
-
GAN-based materials and images post processing, domain adaptation of rendered images
SKY ENGINE provides a variety of GAN networks and deep autoencoder based modules dedicated to learning properties of materials from the samples of real data. Moreover, entire machinery dedicated to domain adaptation or estimation of a detector characteristics are implemented and automatised to make sure that the models trained on synthetic data are ready for an inference in the real world.
- Data for AI models training should be well balanced
- Data must cover edge cases
- Data must be representative to reflect real-world case
- Data synthesis can be successfully applied to cases with data scarcity and to corner cases
- Combination of the Model-based and Data-centric AI strategies provides the most reliable results with much less effort
- Validation data should be consistently labeled
- Problem scope does not have to be narrowed to lead to acceptable results with SKY ENGINE AI Model- and Data-centric strategy
- Synthetic data approach enables efficient and effective application of the Data-centric paradigm
- A talk on Ray tracing for Deep Learning to generate synthetic data at NVIDIA GTC 2020 by Jakub Pietrzak, CTO SKY ENGINE AI (Free registration required)
- Example synthetic data videos generated in SKY ENGINE AI platform to accelerate AI models training
- Presentation on using synthetic data for 5G network performance optimization platform
- Working example with code for synthetic data generation and AI models training for team-based sport analytics in SKY ENGINE AI platform
- Explore warehousing and inventorying solutions with AI training in virtual environments and synthetically generated data – more examples
Multi-person crowd tracking for identity estimation AI model trained on synthetic data in the SKY ENGINE AI platform using built-in neural network architecture
Conclusion
Data, rather than models, is frequently what distinguishes successful machine learning or AI-driven projects from those that never make it to the roll-out phase. Main priority should be on collecting useful data and high quality data. Main takeaways:
Learn more about SKY ENGINE AI Synthetic Data-centric AI
To get more information on synthetic data, tools, methods, technology check out the following resources: