You may have heard how companies like Google and Facebook use machine learning to drive cars, recognize human speech, and classify images. Very cool, you think, but how does that relate to your business? Well, consider how these companies use machine learning today:
- A payments processing company detects fraud hidden among more than a billion transactions in real time, reducing losses by $1 million per month.
- An auto insurer predicts losses from insurance claims using detailed geospatial data, enabling them to model the business impact of severe weather events.
- Working with data produced by vehicle telematics, a manufacturer uncovers patterns in operational metrics and uses them to drive proactive maintenance.
Two themes unify these success stories. First, each application depends on big data: a large volume of data, in a variety of formats and at high velocity. Second, in each case, machine learning uncovers new insights and drives value.
The technical foundations of machine learning are more than 50 years old, but until recently few people outside of academia were aware of its capabilities. Machine learning requires a lot of computing power; early adopters simply lacked the infrastructure to make it cost-effective.
Several converging trends contribute to the recent surge of interest and activity in machine learning:
- Moore’s Law radically reduced computing costs; massive computing power is now widely available at minimal cost.
- New and innovative algorithms provide faster results.
- Data scientists have accumulated the theory and practical knowledge to apply machine learning effectively.
Above all, the tsunami of big data creates analytic problems that simply cannot be solved with conventional statistics. Necessity is the mother of invention, and old methods of analysis no longer work in today’s business environment.
Machine learning techniques
There are hundreds of different machine learning algorithms. A recent paper benchmarked more than 150 algorithms for classification alone. This overview covers the key techniques that data scientists use to drive value today.
Data scientists distinguish between techniques for supervised and unsupervised learning. Supervised learning techniques require prior knowledge of an outcome. For example, if we work with historical data from a marketing campaign, we can classify each impression by whether or not the prospect responded, or we can determine how much they spent. Supervised techniques provide powerful tools for prediction and classification.
Frequently, however, we do not know the “ultimate” outcome of an event. For example, in some cases of fraud, we may not know that a transaction is fraudulent until long after the event. In this case, rather than attempting to predict which transactions are frauds, we might want to use machine learning to identify transactions that are unusual and flag these for further investigation. We use unsupervised learning when we do not have prior knowledge about a specific outcome, but still want to extract useful insights from the data.
The most widely used supervised learning techniques include the following:
- Generalized linear models (GLM) -- an advanced form of linear regression that supports different probability distributions and link functions, enabling the analyst to model the data more effectively. Enhanced with a grid search, GLM is a hybrid of classical statistics and the most advanced machine learning.
- Decision trees -- a supervised learning method that learns a set of rules that split a population into progressively smaller segments that are homogeneous with respect to the target variable.
- Random forests -- a popular ensemble learning method that trains many decision trees, then averages across the trees to develop a prediction. This averaging process produces a more generalizable solution and filters out random noise in the data.
- Gradient boosting machine (GBM) -- a method that produces a prediction model by training a sequence of decision trees, where successive trees adjust for prediction errors in previous trees.
- Deep learning -- an approach that models high-level patterns in data as complex multilayered networks. Because it is the most general way to model a problem, deep learning has the potential to solve the most challenging problems in machine learning.
Key techniques for unsupervised learning include the following:
- Clustering -- a technique that groups objects into segments, or clusters, that are similar to one another on many metrics. Customer segmentation is an example of clustering in action. There are many different clustering algorithms; the most widely used is k-means.
- Anomaly detection -- the process of identifying unexpected events or outcomes. In fields like security and fraud, it is not possible to exhaustively investigate every transaction; we need to systematically flag the most unusual transactions. Deep learning, a technique discussed previously under supervised learning, can also be used for anomaly detection.
- Dimension reduction – the process of reducing the number of variables being considered. As organizations capture more data, the number of possible predictors (or features) available for prediction expands rapidly. Simply identifying what data provides information value for a particular problem is a significant task. Principal components analysis (PCA) evaluates a set of raw features and reduces them to indices that are independent of one another.
While some machine learning techniques tend to consistently outperform others, it is rarely possible to say in advance which one will work best for a particular problem. Hence, most data scientists prefer to try many techniques and choose the best model. For this reason, high performance is essential because it enables the data scientist to try more options in less time.
Machine learning in action
Across industries and business disciplines, businesses use machine learning to increase revenue or reduce costs by performing tasks more efficiently than humans can do unaided. Included below are seven examples that demonstrate the versatility and wide applicability for machine learning.
Preventing fraud. With more than 150 million active digital wallets than $200 billion in annual payments, PayPal leads the online payments industry. At that volume, even low rates of fraud can be very costly; early in its corporate history, the company was losing $10 million per month to fraudsters. To address the problem, PayPal built a top team of researchers, who used state-of-the-art machine learning techniques to build models that can identify fraudulent payments in real time.
Targeting digital display. Ad-tech company Dstillery uses machine learning to help companies like Verizon and Williams-Sonoma target digital display advertising on real-time bidding platforms. Using data collected about an individual’s browsing history, visits, clicks, and purchases, Dstillery runs predictions thousands of times per second, handling hundreds of campaigns at a time; this enables the company to significantly outperform human marketers targeting ads for optimal impact per dollar spent.
Recommending content. For customers of Comcast’s X1 interactive TV service, Comcast provides personalized real-time recommendations for content based on each customer’s prior viewing habits. Working with billions of history records, Comcast uses machine learning techniques to develop a unique taste profile for each customer, then groups customers with common tastes into clusters. For each cluster of customers, Comcast tracks and displays the most popular content in real time, so customers can see what content is currently trending. The net result: better recommendations, higher utilization, and more satisfied customers.
Building better cars. New cars built by Jaguar Land Rover have 60 onboard computers that produce 1.5GB of data every day across more than 20,000 metrics. Engineers at the company use machine learning to distill the data and understand how customers actually work with the vehicle. By working with true usage data, designers can predict part failure and potential safety issues; this helps them engineer vehicles appropriately for expected conditions.
Targeting best prospects. Marketers use “propensity to buy” models as a tool to determine the best sales and marketing prospects and the best products to offer. With a vast array of products to offer, from routers to cable TV boxes, Cisco’s marketing analytics team trains 60,000 models and scores 160 million prospects in a matter of hours. By experimenting with a range of techniques from decision trees to gradient-boosted machines, the team has greatly improved the accuracy of the models. That translates into more sales, fewer wasted sales calls, and more satisfied sales reps.
Optimizing media. NBC Universal stores hundreds of terabytes of media files for international cable TV distribution; efficient management of this online resource is necessary to support distribution to international clients. The company uses machine learning to predict future demand for each item based on a combination of measures. Based on these predictions, the company moves media with low predicted demand to low-cost offline storage. The predictions from machine learning are far more effective than arbitrary rules based on single measures, such as file age. As a result, NBC Universal reduces its overall storage costs while maintaining client satisfaction.
Improving health care delivery. For hospitals, patient readmission is a serious matter, and not simply out of concern for the patient’s health and welfare. Medicare and private insurers penalize hospitals with a high readmission rate, so hospitals have a financial stake in making sure they discharge only those patients who are well enough to stay healthy. The Carolinas Healthcare System (CHS) uses machine learning to construct risk scores for patients, which case managers work into their discharge decisions. This system enables better utilization of nurses and case managers, prioritizing patients according to risk and complexity of the case. As a result, CHS has lowered its readmission rate from 21 percent to 14 percent.
Machine learning software requirements
Software for machine learning is widely available, and organizations seeking to develop a capability in this area have many options. The following requirements should be considered when evaluating machine learning:
- Speed
- Time to value
- Model accuracy
- Easy integration
- Flexible deployment
- Usability
- Visualization
Let’s review each of these in turn.
Speed. Time is money, and fast software makes your highly paid data scientists more productive. Practical data science is often iterative and experimental; a project may require hundreds of tests, so small differences in speed translate to dramatic improvements in efficiency. Given today’s data volumes, high-performance machine learning software must run on a distributed platform, so you can spread the workload over many servers.
Time to value. Runtime performance is only one part of total time to value. The key metric for your business is the amount of time needed to complete a project from data ingestion to deployment. In practical terms, this means that your machine learning software should integrate with popular Hadoop and cloud formats, and it should export predictive models as code that you can deploy anywhere in your organization.
Model accuracy. Accuracy matters, especially when the stakes are high. For applications like fraud detection, small improvements in accuracy can produce millions of dollars in annual savings. Your machine learning software should empower your data scientists to use all of your data, rather than forcing them to work with samples.
Easy integration. Your machine learning software must co-exist with a complex stack of big data software in production. Ideally look for machine learning software that runs on commodity hardware and does not require specialized HPC machines or exotic hardware like GPU chips.
Flexible deployment. Your machine learning software should support a range of deployment options, including co-location in Hadoop or in a freestanding cluster. If cloud is part of your architecture, look for software that runs in a variety of cloud platforms, such as Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
Usability. Data scientists use many different software tools to perform their work, including analytic languages like R, Python, and Scala. Your machine learning platform should integrate easily with the tools your data scientists already use. In addition, well-designed machine learning algorithms include time-saving features such as the following:
- Ability to treat missing data
- Ability to transform categorical data
- Regularization techniques to manage complexity
- Grid search capability for automated test and learn
- Automatic cross-validation (to avoid overlearning)
Visualization. Successful predictive modeling requires collaboration between the data scientist and business users. Your machine learning software should provide business users with tools to visually evaluate the quality and characteristics of the predictive model.
Introducing H2O
H2O is a scalable machine learning platform for data scientists and business analysts. Unlike conventional software, H2O provides a combination of extraordinary math and high performance in a free and open source platform.