- Regression is simply drawing a curve or line through data points.
- Classification is determining to what group something belongs. Binary classification (two groups) is determining if something belongs to a class or not, such as whether the animal in the picture is a dog or not. Sticking with the animal example, multiclass classification (more than two groups) is whether the animal is a dog, cat, bird, etc.
- Clustering is similar to classification, but you don’t know the classifications ahead of time. Again using the examples of animal pictures, you may determine that there are three types of animals, but you don’t know what those animals are, so you just divide them into groups. Generally speaking, clustering is used when there is insufficient supervised data or when you want to find natural groupings in the data without being constrained to specific groups, such as dogs, cats, or birds.
- Time series assumes that the sequence of data is important (that the data points taken over time have an internal structure that should be accounted for). For example, sales data could be considered time-series because you may want to trend revenue over time to detect seasonality and to correlate it with promotion events. On the other hand, the order of your animal pictures doesn’t matter for classification purposes.
- Optimization is a method of achieving the best value for multiple variables when they do not move in the same direction.
- NLP (natural language processing) is the general category of algorithms that try to mimic human use and understanding of languages, such as chatbots, scrubbing unstructured writing like doctor’s notes for key data fields, and autonomous writing of news articles.
- Anomaly detection is used to find outliers in the data. It is similar to control charts but uses lots more variables as inputs. Anomaly detection is especially useful when “normal” operating parameters are difficult to define and change over time, and you want your detection of abnormalities to adjust automatically.
Figure 12: Modeling: types of algorithms.
Deep learning models
Deep learning is based on the concept of artificial neural networks (ANNs). In that way, they work like human brains where synapses become stronger or weaker based on feedback of some sort, and neurons fire based on specified conditions. Hard problems are being solved through deep learning models, including self-driving cars, image detection, video analysis, and language processing. Figure 13 shows their key characteristics.
Lest you think that deep learning models are the only things that should be used, there are some caveats:
- First, they require large amounts of data—generally much more than machine learning models. Without large amounts of data, deep learning usually does not perform as well.
- Second, because deep learning models require large amounts of data, the training process takes a long time and requires a lot of computational processing power. This is being addressed by ever more powerful and faster CPUs and memory as well as newer GPUs and FPGAs (field-programmable logic arrays).
- Third, deep learning models are usually less interpretable than machine learning models. Interpretability is a major area of deep learning research, so perhaps this will improve.
Figure 13: Modeling: deep learning.
How to measure machine learning model performance
Models, just like people, have their performance assessed. Here are a few ways to measure the performance of a relatively simple regression model. The MAE, RMSE, and R2 performance metrics are fairly straightforward, as Figure 14 shows.
All these can be considered a type of cost function, which helps the model know if it’s getting closer or farther away from the “right” answer, and if it’s gotten “close enough” to that answer. The cost function tells the model how far it has to go before it can take new data it hasn’t seen before and output the right prediction with a high enough probability. When training the model, the goal is to minimize the cost function.
Figure 14: Modeling: performance assessment: example of error calculation for regression.
Precision versus recall in classification models
Once the cost function has done its job of helping the model head in the direction of the “right answer” based on training data (data it is being shown), you need to evaluate how well the model performs on data it hasn’t yet seen. Let me explain this in the context of classification models (models that determine whether something is in one group or another, such as if the picture is a dog, cat, rat, etc.).
To assess the performance of classification models (see Figure 15), you use the equation for accuracy (as detailed below). However, it’s generally accepted that when the training data exhibits class imbalance, the accuracy metric might be misleading, so you use metrics called precision and recall instead. Here’s what these terms mean:
- Class imbalance: The data is skewed in one direction versus other directions. Consider the example of predicting whether a credit card transaction is fraudulent. The vast majority of transactions are not fraudulent, and the data set will be skewed in that direction. So, if you predicted that a given transaction is not fraud, you’d probably be right—even if you know nothing about transaction itself. Applying the accuracy metric in this example would mislead you to think you’re doing a great job of predicting transactions that are not fraudulent.
- Precision is a measure of relevance. Pretend you use your search engine to find the origin of the tennis score “love.” Precision measures how many of the items returned are really about this versus links to how much people love tennis, how people fell in love playing tennis, etc.
- Recall is a measure of completeness. Using the same example of the tennis score “love,” recall measures how well the search engine captured all the references that are available to it. Missing zero references is amazing, missing one or two isn’t too bad, missing thousands would be terrible.
Unfortunately, in the real world, precision and recall are traded off; that is, when one metric improves, the other metric deteriorates. So, you’ve got to determine which metric is more important to you.
Figure 15: Modeling: performance assessment: confusion matrix for classification.
Consider the example of a dating app that matches you with compatible people. If you’re great-looking, rich, and have a sparkling personality, you might lean towards higher precision because you know there will be a lot of potential matches, but you only want the ones that are a real fit, and the cost for you to screen potential matches is high (hey, you’re busy building an empire—you’ve got millions of things to do). On the other hand, if you’ve been looking for someone for a long time and your mother’s been on your back, you might lean toward recall to get as many potential matches as possible. The cost of sorting through potential suiters is relatively low compared to the constant nagging from your mother! To assess how well the model balances precision and recall, the F1 score is used.
These metrics can be plotted on a graph, as Figure 16 shows; one is called the ROC curve (receiver operating characteristic curve) and the other is the PR curve (precision-recall curve). A perfect curve (which you will never get unless you cheat!) is a curve that goes up the Y axis to 1 and then goes across the top. In the case of the ROC curve, a straight line across the diagonal is bad—this means that model predicts true positives and true negatives equally at 50 percent rates (no better than random guesses). These metrics are frequently converted to an area under the curve (AUC), so you’ll see terms like AUC ROC and AUC PR.
Figure 16: Modeling: performance assessment: ROC and PR curves.
Why building machine learning models can be hard
Now that you understand what a model is and how to judge a model’s performance, let’s explore why building a well-performing model can be hard. There are several reasons, as Figure 17 shows. Among them: problem formulation, data issues, selecting the appropriate model algorithm and architectures, selecting the right features, adjusting hyperparameters, training models, cost (error) functions, and underfitting (bias) and overfitting (variance).
Be aware that data science, just like any other science, is both an art and science. Of course, there are always brute-force ways to do things, but those approaches can be time-consuming, may miss insights, and may just plain get things wrong. The current approach of data science is to pool the knowledge of subject matter experts (such as lines of business, operations, and transformation and improvement specialists) and data scientists to create models that fulfil the business needs.
Figure 17: Modeling: common obstacles leading to poor performance.
Overfitting versus underfitting
Overfitting and underfitting are particularly popular problem outcomes, so let’s delve into them a bit. As Figure 18 shows, they involve bias and variance.
Overfitting (high variance) means that the model responds too much to variations in the data, such that it hasn’t really learned the true meaning and instead “memorized” the data. It would be the same as you reading a math book in school and, when given a test on it, know the answers only to the three examples given in the book. When the teacher asks you these math problems (say, 2+1=3, 7+2=9, and 4+2=6), you get them right. But when she asks you something different—say, 1+1=?—you don’t know the answer. That’s because you haven’t learned what addition is, even though you know the answers to the examples. (By the way, don’t tell my professors, but this method saved my bacon in college back in the day!)
Underfitting (high bias) is the opposite problem in that you refuse to learn something new. Maybe you know how to do addition in base 10. But now circumstances have changed, and you’re asked to do addition in base 16. If you exhibit high bias, you’ll continue to do base-10 addition and not learn base-16 addition, and so you get the wrong answers.
Both are problems, and data science has mechanisms to help mitigate them.
Figure 18: Modeling: obstacles: bias and variance.
Machine learning model examples
Let’s go through a couple of machine learning examples of using two types of algorithms: eager algorithms and lazy algorithms. Figure 19 shows examples of both.
Eager algorithms don’t use explicit training (the first path in the diagram), whereas lazy algorithms are explicitly trained (the second path in the diagram). Because eager algorithms aren’t explicitly trained, their training phase is fast (nonexistent, actually), but their execution (or inference phase) is slower than the trained lazy algorithms. Eager algorithms also use more memory because the entire data set needs to be stored, while the data used to train the lazy algorithm can be discarded once training is completed, using less overall memory.
Figure 19: Machine learning examples with and without training.
Example: Document search using TF-IDF
In this first example of an eager algorithm applied to text analytics, I’m using an algorithm called TF-IDF. I’ll explain what TF and IDF mean shortly, but let’s first be clear on the goal of this example. There are five simple, short documents (Documents 1 to 5), as Figure 20 shows. There’s also a dictionary of keywords for these documents; the dictionary is used for keyword searches. And there is also a user who has a query. The goal is to retrieve documents that best fit the user’s query. In this example, you want to return the five documents in order of prioritized relevance.
Figure 20: Text analytics example: TF-IDF problem.
First, let me explain clarify the TF and IDF acronyms. TF stands for term frequency, or how often a term appears (that is, the density of that term in the document). The reason you care is because you assume that when an “important” terms appear more frequently, the document it’s in is more relevant; TF helps you map terms in the user’s query to the most relevant documents.
IDF stands for inverse document frequency. This is almost the opposite thinking—terms that appear very frequently across all documents have less importance, so you want to reduce the importance weight of those terms. Obvious words are “a,” “an,” and “the,” but there will be many others for specific subjects or domains. You can think of these common terms as noise that confuses the search process.
Once TF and IDF values are calculated for the documents and the query, you just calculate the similarity between the user’s query and each document. The higher the similarity score, the more relevant the document. Then you present those documents to the user in order of relevance. Easy right?
Now that you understand how it’s done, you just have to do the calculations. Figure 21 shows the solution.
Figure 21: Document search (text analytics): solution.
Let’s walk through the calculations. By the way, you’ll see there are several matrices. Machine learning and deep learning models do a lot of their calculations using matrix math. You’ll want to be aware of that as you work with data scientists; you’ll want to help them get the data into these types of formats in a way that makes sense for the business problem. It’s not hard, but it’s part of the art of the data science preprocessing stage.
In the first TF matrix, you calculate the normalized (“relative”) frequency of each keyword (as specified in the dictionary) for each document. The numerator represents the word count frequency in that document, and the denominator represents the maximum number of times that word appeared in any give document; in other words, it’s the maximum value across all the numerators.
In the second matrix, you add an IDF vector in the last row for each term in the dictionary. You just apply the equation you’ve been given: IDF(t) = log(N/n(t)), where
- N = number of recommendable documents
- n (t) = number of documents in which keyword t appears
The next step is to create the TF-IDF matrix for the documents by multiplying each row of the documents by the last IDF row. Now you’re done with the document matrix. Repeat the same process to create the user-query matrix.
Finally, combine the two matrices and calculate the similarity between each document and the user query. In this case, you use an equation to calculate similarity called cosine similarity (there are other similarity calculations you can use as well). The equation is represented in the figure, and the values are in the last column. Notice that the similarity value between the user query and itself is 1—as it should be because it’s being compared to itself.
From here, you can sort the similarity values (in the last column of the matrix) from highest to lowest, thus presenting the user with documents from most to least relevant. Now you’re done! Notice there was no “training” of the model; you just applied a few equations.