The terms "supervised learning" and "unsupervised learning" will appear often in conversations about data science, machine learning, and other related topics. The ability to discern between supervised and unsupervised learning is basic information that will come up repeatedly in a data science career.
What is Supervised Learning?
Supervised learning is a machine learning and computer vision approach in which a model is fed training data that is labelled. Data labels are used to assist a model in learning to classify data such as images. Predicting housing values, detecting whether or not an image includes explicit information, and assessing whether or not a screw is in a shot are all instances of issues that may be handled with supervised learning.
Consider the following scenario: you are showing a youngster an automobile for the first time. You want them to understand that what they are seeing is an automobile. "Hey, this is a car," you say.
After an hour, ask the youngster to name the object you taught them earlier. The youngster is unlikely to remember it. So, in order to teach the youngster, you repeat, for example, "this is a car" a million times.
Your aim is that the youngster will learn what a car is via repetition. And if the youngster sees another automobile in the future (different from the one you showed them earlier), you hope they would still recognise it as a car.
Child learning what is the car.
The same phenomenon happens with Supervised Learning's machine learning methods. Because we have labelled datasets (i.e. we know the meaningful information about each data point), the algorithm learns by presenting the data to it multiple times.
In jargon, the data is denoted by X and the label by Y.
Classification and regression are the two most prevalent supervised learning techniques.
- Classification: Such algorithms provide answers to questions such as "is it a lion or a bird?" and "to which class does this datapoint belong?" It can be both a lion and a bird in multi-label categorization use cases.
Classification algorithms attempt to learn an appropriate decision boundaries that divides the classes.
If we project the characteristics of both groups, we will most likely end up with something like this chart. Of fact, there are an endless number of decision limits that may be learnt, but based on the chart, the dotted one appears to be the most promising.
Logistic Regression, Support Vector Machines, Decision Tree Classification, and K-Nearest Neighbours are some common classification techniques.
- Regression: Assume we have a dataset with two input characteristics containing the size of a home and its location. We'd want to forecast the property price. We have a continuous label to forecast in this scenario, not a single one as in classification tasks. So classification cannot be used since we would have a large number of labels and a relatively tiny data-point size per label. As a result, we employ regression to better comprehend the relationship between the dependent variable (the outcome to be predicted) and the independent variables (the input features).
Linear Regression, Polynomial Regression, Decision Tree Regression, and Lasso Regression are some examples of typical regression techniques.
Challenges in Supervised Learning
The more data we have, in general, the better. Machine learning algorithms are now trained on millions of pieces of data. However, gathering and labelling such a large volume of data is frequently time-consuming. Synthetic data solutions, like SKY ENGINE AI Synthetic Data Cloud for Vision AI contribute to making supervised learning more accessible.
Some forms of data are easier to label than others, depending on the task at hand. To complete a task like semantic segmentation, you cannot just give a single word as a label; instead, each pixel of the image must be appropriately labelled. Because this sort of labelling is a time-consuming job, some datasets are modest in size, which is undesirable because we want as much data as feasible.
Another disadvantage is that the labels individuals attach to the data may contain mistakes. COCO is a frequently quoted dataset with several recognised flaws.
Occasionally, the data we have is less than optimal for the given use case. Are we certain that all of the images in the ImageNet dataset, which contains millions of images, are "good" photographs?
When developing models, we frequently rely on whoever provided the dataset and bear in mind that certain photos or labels may include inaccuracies. It's not ideal, but as long as you keep the variance in mind, it'll do.
Supervised Learning Cases
We can employ supervised learning models to construct and advance various commercial applications, like the following, because we have a lot of processing power and a lot of data.
- Medical diagnostics/Healthcare: Medicine is one of the most significant applications of supervised learning. These models can be beneficial in providing services to clinicians that assist them identify or forecast disease in patients at early stages.
- Autonomous vehicles: We can now better develop self-driving cars owing to significant advances in computer vision, particularly in image classification, object identification, semantic segmentation, and depth estimation.
- Spam detection: Spam delivered over email and social networks is frequently detected using supervised learning algorithms. In the instance of email, the outcome of a trained spam classifier (is this email spam or not spam?) determines whether or not an email appears in your spam mailbox.
- Sentiment recognition: Is a message delivered good or negative? That is the question that sentiment analysis seeks to address. To identify sentiment in a corpus of text, the discipline employs unsupervised learning approaches.
What is Unsupervised Learning?
Unsupervised learning is the use of statistical models using data that do not have a label attached. Unsupervised learning is frequently used to cluster data in datasets. Patterns may be discovered from a dataset using unsupervised learning without the requirement to classify every item of data in the sample.
Clustering is a data mining technique that organises unlabeled data into groups based on similarities or differences. K-means is a well-known clustering technique. K-means clustering is a popular example of an exclusive clustering approach in which data points are assigned to one of K groups depending on their distance from the centroid of each group. The data points closest to a specific centroid will be grouped together.
Dimensionality reduction: while more data typically gives more accurate findings, it can also have an influence on the performance of machine learning algorithms (e.g., overfitting) and make dataset visualisation challenging. When the number of characteristics, or dimensions, in a given dataset is too large, dimensionality reduction is performed. It minimises the amount of data inputs to a reasonable quantity while keeping the dataset's integrity as much as feasible.
Principal component analysis (PCA) is a typical dimensionality reduction method: PCA utilises a linear transformation to construct a new data representation, generating a collection of "principal components." The first main component is the direction that maximises the dataset's variance. While the second main component finds the most variation in the data, it is fully uncorrelated to the first, producing a direction that is perpendicular, or orthogonal, to the first.
An association rule is a rule-based mechanism for determining associations between variables in a given dataset. These techniques are commonly employed in market basket analysis, helping businesses to better understand the linkages between various items. Understanding client consumption habits allows organisations to create stronger cross-selling tactics and recommendation engines.
Challenges in Unsupervised Learning
While unsupervised learning has numerous advantages, it may also pose some issues since it allows machine learning models to run without human interaction. Among these difficulties are:
- Because there is no equivalent output, unsupervised learning is inherently more challenging than supervised learning.
- Because input data is not labelled and algorithms do not know the precise output in advance, the outcome of the unsupervised learning method may be less accurate.
- It can be costly since it may necessitate human intervention to comprehend the patterns and link them with domain knowledge, when we would want to have as little human interaction as possible.
Unsupervised Learning Cases
Machine learning techniques are now widely employed to improve product user experience and to test systems for quality assurance. When opposed to manual observation, unsupervised learning gives an exploratory approach to evaluate data, allowing firms to uncover patterns in enormous amounts of data more quickly.
The following are some of the most popular real-world uses of unsupervised learning:
- Anomaly detection: Unsupervised learning algorithms may find anomalies in a dataset by detecting abnormal data points. These anomalies can raise awareness about malfunctioning technology, human mistake, security breaches, and fraud detection. We may avoid anything from happening by forecasting anomalies, which typically results in not losing money or restricting it as much as feasible. An example of this would be detecting anomalies in a video stream from a security camera.
- Data synthesis: If more data is required, generative models (such as Variational Autoencoders or Generative Adversarial Networks) can be used to produce new synthetic realistic data. The primary benefit of creating synthetic images is that labels are known ahead of time.
- Image compression: Unsupervised learning methods are used in dimensionality reduction to lower the number of dimensions in a dataset while retaining the majority of the information in the data.
- Data labelling: Assume we have an image dataset and want to label it so that we can do image classification on it later. We may use clustering and hope that the algorithm identifies different groupings so that we can label each image in the same cluster with the same label. We will never have wholly distinct clusters, of course. Certain images may lie between two or more groups. In such instance, one may examine the image and manually append a label.
Supervised vs. Unsupervised Learning
When comparing supervised vs unsupervised learning, one rule of thumb to remember is that you use supervised learning algorithms when your labels are known and unsupervised algorithms when they are not.
Supervised learning techniques offer the benefit of allowing you to design models with labels that, in theory, allow you to teach a model whatever you want and develop metrics that, due to the labels, can quantify the model's success. They do, however, have the drawback of incurring a fee for labelling.
Even if you don't have the labels, unsupervised techniques are typically highly strong since they allow you to automatically label your data, saving you an incalculable amount of time and also generating fresh synthetic realistic data, which is critical when you don't have enough.
Learn more about SKY ENGINE AI offering
To get more information on synthetic data, tools, methods, technology check out the following resources:
- Telecommunication equipment AI-driven detection using drones
- A talk on Ray tracing for Deep Learning to generate synthetic data at GTC 2020 by Jakub Pietrzak, CTO SKY ENGINE AI (Free registration required)
- Example synthetic data videos generated in SKY ENGINE AI platform to accelerate AI models training
- Presentation on using synthetic data for building 5G network performance optimization platform
- Working example with code for synthetic data generation and AI models training for team-based sport analytics in SKY ENGINE AI platform