Despite the success of supervised machine learning and deep learning, there’s a school of thought that says that unsupervised learning has even greater potential. The learning of a supervised learning system is limited by its training; i.e., a supervised learning system can learn only those tasks that it’s trained for. By contrast, an unsupervised system could theoretically achieve “artificial general intelligence,” meaning the ability to learn any task a human can learn. However, the technology isn’t there yet.

If the biggest problem with supervised learning is the expense of labeling the training data, the biggest problem with unsupervised learning (where the data is not labeled) is that it often doesn’t work very well. Nevertheless, unsupervised learning does have its uses: It can sometimes be good for reducing the dimensionality of a data set, exploring the pattern and structure of the data, finding groups of similar objects, and detecting outliers and other noise in the data.

In general, it’s worth trying unsupervised learning methods as part of your exploratory data analysis to discover patterns and clusters, to reduce the dimensionality of your data, to discover latent features, and to remove outliers. Whether you then need to move on to supervised learning or to using pre-trained models to do predictions depends on your goals and your data.

## What is unsupervised learning?

Think about how human children learn. As a parent or teacher you don’t need to show young children every breed of dog and cat there is to teach them to recognize dogs and cats. They can learn from a few examples, without a lot of explanation, and generalize on their own. Oh, they might mistakenly call a Chihuahua “Kitty” the first time they see one, but you can correct that relatively quickly.

Children intuitively lump groups of things they see into classes. One goal of unsupervised learning is essentially to allow computers to develop the same ability. As Alex Graves and Kelly Clancy of DeepMind put it in their blog post, “Unsupervised learning: the curious pupil,”

Advertisement

Unsupervised learning is a paradigm designed to create autonomous intelligence by rewarding agents (that is, computer programs) for learning about the data they observe without a particular task in mind. In other words, the agent learns for the sake of learning.

The potential of an agent that learns for the sake of learning is far greater than a system that reduces complex pictures to a binary decision (e.g. dog or cat). Uncovering patterns rather than carrying out a pre-defined task can yield surprising and useful results, as demonstrated when researchers at Lawrence Berkeley Lab ran a text processing algorithm (Word2vec) on several million material science abstracts to predict discoveries of new thermoelectric materials.

## Clustering methods

A clustering problem is an unsupervised learning problem that asks the model to find groups of similar data points. There are a number of clustering algorithms currently in use, which tend to have slightly different characteristics. In general, clustering algorithms look at the metrics or distance functions between the feature vectors of the data points, and then group the ones that are “near” each other. Clustering algorithms work best if the classes do not overlap.

### Hierarchical clustering

Hierarchical cluster analysis (HCA) can be agglomerative (you build the clusters bottom-up starting with individual points and ending with a single cluster) or divisive (you start with a single cluster and break it up until you wind up with individual points). If you’re lucky you can find an intermediate stage of the clustering process that reflects a meaningful classification.

The clustering process is usually displayed as a dendrogram (tree diagram). HCA algorithms tend to take a lot of compute time [*O*(n^{3})] and memory [*O*(n^{2})] resources; these limit the applicability of the algorithms to relatively small data sets.

HCA algorithms can use various metrics and linkage criteria. Euclidian distance and squared Euclidian distance are both common for numeric data; Hamming distance and Levenshtein distance are common for non-numeric data. Single-linkage and complete linkage are common; both of these can simplify the clustering algorithms (SLINK and CLINK respectively). SLINK is one of the few clustering algorithms guaranteed to find an optimum solution.

### K-means clustering

The k-means clustering problem attempts to divide *n* observations into *k* clusters using the Euclidean distance metric, with the objective of minimizing the variance (sum of squares) within each cluster. It is a method of vector quantization, and is useful for feature learning.

Lloyd’s algorithm (iterative cluster agglomeration with centroid updates) is the most common heuristic used to solve the problem, and is relatively efficient, but doesn’t guarantee global convergence. To improve that, people often run the algorithm multiple times using random initial cluster centroids generated by the Forgy or Random Partition methods.

K-means assumes spherical clusters that are separable so that the mean converges towards the cluster center, and also assumes that the ordering of the data points does not matter. The clusters are expected to be of similar size, so that the assignment to the nearest cluster center is the correct assignment.

The heuristics for solving k-means clusters are usually similar to the expectation-maximization (EM) algorithm for Gaussian mixture models.

### Mixture models

Mixture models assume that the sub-populations of the observations correspond to some probability distribution, commonly Gaussian distributions for numeric observations or categorical distributions for non-numeric data. Each sub-population may have its own distribution parameters, for example mean and variance for Gaussian distributions.

Expectation maximization (EM) is one of the most popular techniques used to determine the parameters of a mixture with a given number of components. In addition to EM, mixture models can be solved with Markov chain Monte Carlo, moment matching, spectral methods with singular value decomposition (SVD), and graphical methods.

The original mixture model application was to separate two populations of shore crabs by forehead to body length ratios. Karl Pearson solved this problem in 1894 using moment matching.

A common extension of mixture models is to connect the latent variables defining the mixture component identities into a Markov chain instead of assuming that they are independent identically distributed random variables. The resulting model is called a hidden Markov model and is one of the most common sequential hierarchical models.

### DBSCAN algorithm

Density-based spatial clustering of applications with noise (DBSCAN) is a non-parametric data-clustering algorithm that dates from 1996. It is optimized for use with databases that can accelerate geometric region queries using an R* tree or some other geometric index structure.

Essentially, DBSCAN clusters *core points* that have more than some minimum number of neighbors within some distance Epsilon, discards as outliers points that have no neighbors within Epsilon, and adds points that are within Epsilon of a core point to that cluster. DBSCAN is one of the most common clustering algorithms, and can find arbitrarily shaped clusters.

### OPTICS algorithm

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. OPTICS is similar to DBSCAN, but handles the case of varying point density.

Variations of the ideas in DBSCAN and OPTICS can also be used for simple outlier and noise detection and removal.

## Latent variable models

A latent variable model is a statistical model that relates a set of observable variables to a set of latent (hidden) variables. Latent variable models are useful for revealing hidden structures in complex and high-dimensional data.

### Principal component analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated numeric variables into a set of values of linearly uncorrelated variables called principal components. Karl Pearson invented PCA in 1901. PCA can be accomplished by eigenvalue decomposition of a data covariance (or correlation) matrix, or singular value decomposition (SVD) of a data matrix, usually after a normalization step of the initial data.

### Singular value decomposition

The singular value decomposition (SVD) is a factorization of a real or complex matrix. It’s a common technique in linear algebra, and is often computed using Householder transformations. SVD is one way to solve for principal components. While it’s perfectly possible to code SVD from scratch, there are good implementations in all of the linear algebra libraries.

### Method of moments

The method of moments uses the moments of the observed data sample (mean, variance, skewness, and kurtosis) to estimate population parameters. The method is quite simple, can often be calculated by hand, and usually achieves global convergence. In the case of low statistics, however, the method of moments can sometimes produce estimates that are outside the parameter space. The method of moments is an easy way to solve mixture models (above).

### Expectation-maximization algorithms

An expectation–maximization (EM) algorithm is an iterative method to find maximum likelihood estimates of parameters in models that depend on unobserved latent variables. The EM iteration alternates between performing an expectation step (E), which creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters, and a maximization step (M), which computes parameters maximizing the expected log-likelihood found on the E step.