What is Synthetic Data? Insights

Synthetic data plays increasingly important role in computer vision, especially the data generated from computer simulations provides competitive alternative to real-world data and gains momentum in multiverse of use cases to create accurate AI models.

Synthetic data is one of the most useful data-driven AI approaches, enabling endless data streams for any organizations looking to increase the performance of their AI models or to test the AI products. As data is the new oil in today's AI models development and production, we are running AI fuel refinery for synthetic data.

What is synthetic data in AI computer vision applications?

Acquiring training data, balancing datasets with enough variety, and exact labeling are three of the most difficult tasks in creating accurate and efficient AI models for computer vision. Accelerated AI models trained in virtual environments with synthetic datasets give viable answers to these issues by eliminating the need for costly and time-consuming data collection and labeling.

Instead of being gathered or measured in the real world, synthetic data is produced in digital reality.

By 2030, Synthetic Data will completely overshadow real data in AI models.

- Gartner

Although it is artificial, synthetic data mathematically or statistically reflects real-world data. According to recent work it can be as useful as, if not better than, data collected from actual objects, events, or people for training an AI model.

Synthetic human created and generated in the SKY ENGINE AI platform for facial recognition and gaze estimation – (Left) Face fragment with large level of details including detailed eyes with pupil, iris and microstructures, eyebrows, eyelashes, hair, skin and wrinkles; (Right) Detailed view of the eye with surroundings and 3D mesh.

The growth of synthetic data coincides with the call of AI pioneer Andrew Ng for a wide shift to a more data-centric approach to machine learning.

Most benchmarks provide a fixed set of data and invite researchers to iterate on the code …
perhaps it’s time to hold the code fixed and invite researchers to improve the data...

- Andrew Ng

As a result, data scientists and developers are increasingly using synthetic data to train AI models, especially in computer vision. According to a surveys of the industry, the use of synthetic data is "one of the most promising general approaches on the rise in current deep learning, notably computer vision," which depends on unstructured data such as images and video.

Actually, back in 2017 Deloitte predicted in their "Machine Learning and five vectors of progress" report: "A number of promising techniques for reducing the amount of training data required for machine learning are emerging. One involves the use of synthetic data, generated algorithmically to mimic the characteristics of the real data. This can work surprisingly well."

Synthetic data will become the main form of data used in AI. Source: Gartner, “Maverick Research: Forget About Your Real Data – Synthetic Data Is the Future of AI,” Leinar Ramos, Jitendra Subramanyam, 24 June 2021.

Gartner stated in a June 2021 study on synthetic data that by 2030, most of the data used in AI will be created artificially by rules, statistical models, simulations, or other techniques.

The fact is you won’t be able to build high-quality, high-value AI models without synthetic data.

- Gartner

SKY ENGINE AI has been pioneering the approach of AI models training in virtual reality using synthetic data for computer vision. These methods reached a new level with AI-powered physically-based simulations and deep adaptive learning. Such technology can create an unlimited amount of photo realistic, privacy-free synthetic data. SKY ENGINE AI pioneered data synthesis and AI models training in virtual environments for imagery, and video data. SKY ENGINE AI is the only full stack expert platform in simulating and generating multimodality synthetic data for computer vision and adaptive AI models training engine.

Value and importance of synthetic data

Deep learning developers require large, accurately labelled and well-balanced datasets to train AI models efficiently. Main challenge in AI models production is such datasets acquisition being very expensive and time-consuming plus there is very costly data annotation.

But cost of synthetic data generation can be 100x lower than that. And this is just the beginning. Synthetic data plays fundamental role in deploying AI in the real world as, through physics-based simulations, it can well represent edge cases (known for lacking representative data) providing the only solution to these. In addition, privacy concerns are not an issue anymore

Synthetic Data Use Cases

There are several use cases for synthetic data in banking, finance, language processing, drones, automotive industry, healthcare, retail, manufacturing or robotics. In this article, we are focusing on synthetic data for AI models training in computer vision applications across all these industries.

Main applications of synthetic data include:

Balancing of training datasets
(Accurate) AI models training
Data and AI models validation and benchmarking
Ensuring data privacy
Reducing the cost of real training data acquisition
Developing deepfake detection systems
Bulding metaverse(s)

In SKY ENGINE AI we have been working with healthcare providers and vendors in the field of diagnostic medical imaging to deliver synthetic data and accurate AI models. Polyps detection performed by deep learning model trained on synthetic data results in 95% precision and recall offering very promising approach for delivering accurate automatisation of screening task in the diagnostic process. For the purpose of the AI model training we have created thousands of images in few days and the data was already labelled.

We are also working with telecommunication companies – network operators and tower providers – to continuously optimize 5G RANs (Radio Access Network) using synthetic data in a training of the AI models for equipment detection and telco site 3D reconstruction.

In agriculture, our company helped one of the largest UK research institute generating thousand of images and video for training AI drones to operate autonomously, using visual data alone, in the vineyard's aisles and in the GPS-denied environments. In this case, we use visual light simulations, recreating vineyard environment synthetically and generating a dataset of a thousand of images. Such dataset make it possible to put the drone in a virtual world for training.

How to create Synthetic Data?

Synthetic datasets can be generated using numerous methods, but most accurate are physically-based simulations, especially useful in cases requiring high level of details seen on the image. Also GANs (Generative Adversarial Networks) are on the rise. The limitation of current GAN datasets, however, is that the models are currently incapable of producing images in other than visible light modality, i.e. full X-ray images. Moreover, the quality of generated images is far from being realistic.

In SKY ENGINE AI, we are using both methods but data is mostly generated using physically-based simulations and rendering of multiple modalities, including visual light, near infrared (NIR), thermal vision, X-rays, lidars, radars and sonars. Render passes are dedicated to deep learning with animation and motion capture systems support. Determinism and advanced machinery for randomization strategiesare built-in the SKY ENGINE AI platform for the scene parameters in the active learning approach. There are also GAN-based materials and GAN algorithms serve image postprocessing.

Simulations offer a variety of features for segmenting and classifying static images and videos, resulting in flawless labeling. They may also rapidly produce diverse data (including ground-truths) of objects and settings with varying environmental conditions, colors, lighting, materials, and postures. It is also possible to process that data with domain adaptation to accurately represent characteristics of real-world objects in synthetic data which is crucial for accurate AI inference.

Featured image with the examples of synthetic data simulated and generated in SKY ENGINE AI platform for computer vision AI models training in numerous industries.

Domain adaptation: what is it for?

Domain adaptation is essential for accurate deep learning models inference in unfamiliar contexts, in new, unseen environments. Domain adaptation enables neural networks to better represent the feature space of the real object in synthetic data. It can be applied in a variety of visual recognition and prediction settings across multiple adaptation tasks, including digit classification and semantic segmentation of real-world scenes through a transfer from real world to synthetic domains.

SKY ENGINE AI platform provides tools which can learn the distribution of key parameters of the scene of interest in an unsupervised way from a very small sample of unlabelled data. Moreover, our engine is equipped with the functionality enabling the understanding of the characteristics of target sensor to adapt the domain efficiently for synthesised data.

Enabling a market for Synthetic Data

Although synthetic data sector is relatively new, there can be found several providers in this fast-growing market with SKY ENGINE AI being one of the first pioneers there. These companies can be split in two groups: 1) providers of synthetic data for structured data (tabular data); and 2) providers of synthetic data for unstructured data (image & video).

There exist also a dozen open source tools or datasets, including the Synthetic Data Vault, a set of libraries, projects and tutorials developed at Massachusetts Institute of Technology (MIT).

In SKY ENGINE AI we aim to work with Enterprise partners and blue-chip companies like NVIDIA or Microsoft to enable efficient use of synthetic data at scale. SKY ENGINE AI is already an official supplier of synthetic data and AI models to NVIDIA Metropolis and TAO Toolkit, as well as a research partner and supplier of Microsoft.

SKY ENGINE's goal with AI platform is to enable an ever-expanding galaxy of data scientists and developers interested in creating or working on AI models in virtual environments across all industries. And today many businesses can incorporate our synthetic data generation platform into their routines as a new standard.

Video example of synthetic data and AI infrerence by drone simulated and generated in SKY ENGINE AI platform for computer vision AI models training in predictive maintenance.

With these tools the Developers can create a digital twin of any sensor, drone or robot and put them through testing and training in a virtual (digital) reality with synthetic data simulation and generation, domain adaptation and randomisation prior to real-world deployment.