In providing the ability to discover patterns buried deep within data, machine learning has the potential to make applications more powerful and more responsive to users’ needs. Well-tuned algorithms allow value to be extracted from immense and disparate data sources without the limits of human thinking and analysis. For developers, machine learning offers the promise of applying business critical analytics to any application in order to accomplish everything from improving customer experience to providing product recommendations to serving up hyper-personalized content.
Cloud providers like Amazon and Microsoft have made headlines of late by offering cloud-enabled machine learning solutions that promise an easy way for developers to integrate the power of machine learning into their applications. While the promise seems great, developers should be cautious.
Cloud-based machine learning tools can act as a way for developers to dip their toes into the possibilities that machine learning creates and can offer novel functionality. When used incorrectly, however, these tools garner poor results, which can be frustrating for users. As anyone who tested Microsoft’s age-detecting machine learning tool probably discovered, the plug-and-play ease of use came with major accuracy problems -- not something one should trust for critical applications or when making important decisions.
Developers looking to incorporate machine learning in their applications need to be aware of some keys to success:
1. The more data an algorithm has, the more accurate it becomes, so avoid subsampling if possible. Machine learning theory has a very intuitive characterization of the prediction error. In brief, the gap in prediction error between a machine learning model and the optimal predictor (the one that achieves the best possible error in theory) can be decomposed into three parts:
- The error due to not having the right functional form for the model
- The error due to not finding the optimal parameters for the model
- The error due to not feeding enough data to the model
If the training data is limited, it may not be able to support the model complexity needed for the problem. Foundational laws of statistics tell us we should use all the data that we have if we can, rather than a subsample.
2. Selecting the machine learning method that works best for the given problem is key and often determines success or failure. For example, Gradient Boosting Trees (GBT) is a popular supervised learning algorithm widely used by industry practitioners due to its accuracy. However, despite its high popularity, it should not be blindly treated as the algorithm for every problem. Instead, one should always use the algorithm that best fits the characteristics of the data for the most accurate results.
To demonstrate this concept, one can try an experiment comparing the accuracy between GBT and the linear Support Vector Machine (SVM) algorithm on the popular text categorization dataset rcv1. We observed that linear SVM is superior to GBT in terms of error rate on this problem. This is because in the domain of text, the data is often highly dimensional. A linear classifier can perfectly separate N examples in N − 1 dimensions, and thus, a simple model is likely to work well on such data. Moreover, the simpler the model, the less problematic it is to learn the parameters with a finite number of training examples to avoid overfitting and deliver an accurate model.
On the other hand, GBT is highly nonlinear and more powerful, but more difficult to learn and more prone to overfitting in such a setting. It often ends up with inferior accuracy.
3. To get a great model, the method and the parameters pertaining to the method must be chosen well. This may not be simple for the nondata scientist. Modern machine learning algorithms have a number of knobs to tweak. For example, the popular GBT algorithm alone can have up to a dozen parameter settings, including how to control tree size, the learning rate, the sampling methodology for rows or columns, the loss function, the regularization options, and more. A typical project requires finding the best values for each of those parameters to get the highest possible accuracy for a given data set, and this is no easy feat. Intuition and experience help, but for best results, a data scientist needs to train a large number of models, looking at their cross-validated scores and putting some thought into deciding what parameters to try next.
4. Machine learning models can only be as good as the data. Improper data collection and cleaning will hurt your ability to build predictive, generalizable machine learning models. Experience recommends carefully reviewing the data with subject matter experts to gain insights into the data and the data generation process behind the scenes. Often this process can identify data quality issues related to records, features, values, or sampling.
5. Understanding features in the data and improving upon them (by creating new features and eliminating existing ones) has a high impact in terms of predictability. One fundamental task of machine learning is to represent the raw data in a rich feature space that can be effectively exploited by the machine learning algorithm. For example, feature transformation is a popular method that achieves this by developing new features based on the original ones through mathematical transformations. The resulting feature space (that is, the collection of features used to characterize the data) better captures various complex characteristics of the data (such as nonlinearity and interaction between multiple features), which are important for the succeeding learning processes.
6. Selecting the appropriate objective/loss function inspired by the business value is important for ultimate success in the application. Almost all machine learning algorithms are formulated as optimization problems. Based on the nature of the business, appropriately setting or adjusting the objective function of the optimization is a key step to the success of machine learning.
SVM, as an example, optimizes the generalization error for a binary classification problem by assuming all types of errors are equally weighted. This is not appropriate for cost-sensitive problems, such as failure detection, in which the cost of certain types of errors might weigh more than the others. In this case, it is recommended to adjust the SVM loss function by adding more penalties on certain types of error to account for their weights.
7. Ensure proper handling of training and testing data so the testing data mimics incoming data when the model is deployed in production. We can see, for example, how important this is for time-dependent data. In this case, using the standard cross-validation approach for training, tuning, and testing models would result in misleading or even inaccurate outputs. This is because it doesn’t properly mimic the nature of incoming data in the deployment stage. To correct this, one must mimic how the model is used when deployed. One should use a time-based cross-validation in which the trained model is validated on newer data in terms of time.
8. Understand the generalization error of the model before deployment. Generalization error measures how well a model performs on unseen data. Just because a model performs well on training data doesn’t necessarily mean it will generalize well on unseen data. A carefully designed model evaluation process, which mimics the real deployment usage, is needed to estimate the generalization error of the model.
It’s easy to violate the rules of cross-validation without noticing, and there are non-obvious ways to perform cross-validation incorrectly, which often happens when you attempt to take computational shortcuts. It is essential to pay careful attention to proper and diligent cross-validation before deploying any models to obtain a scientific estimation of the deployment performance.
9. Know how to treat unstructured and semistructured data, such as text, time series, spatial, graph data, and images. Most machine learning algorithms deal with data in feature space where an object is represented by a set of features, each describes a characteristic of the object. In practice, instead of being introduced into the set in this format, data often comes in raw form and must be molded into the desirable format for the consumption of machine learning algorithms. For example, one has to know how to use various computer vision techniques for extracting features from images or how to apply natural language processing techniques for featurizing text.
10. Learn to translate business problems into machine learning algorithms. Some important business problems, such as fraud detection, product recommendation, and ad targeting, have “standard” machine learning formulations that have met with reasonable success in practice. Even for these well-known problems, there are lesser-known but more powerful formulations that lead to higher predictive accuracy. For business problems outside the small set of examples typically discussed in blogs and forums, the proper machine learning formulations are less obvious.
If you are a developer, learning these 10 keys to success might seem like a tall task, but don’t be discouraged. In truth, developers are not data scientists. It would be unfair to think a developer can fully exploit all the tools machine learning offers. But that doesn’t mean developers don’t have the opportunity to learn some level of data science in order to improve their applications. With proper enterprise solutions and increased automation, developers can do everything from building models to deploying them, using machine learning best practices to maintain high accuracy.
Automation is key to the proliferation of machine learning within applications. Even if you could afford a small army of data scientists to work hand in hand with developers, there isn’t enough talent to go around. Advances like Skytree’s AutoModel can help developers automatically determine optimal parameters and algorithms for maximum model accuracy. An easy-to-use interface can guide developers through the process of training, tuning, and testing models, while preventing statistical mistakes.
Automation within the machine learning process, in many ways, incorporates the principles of artificial intelligence for the data scientist or developer, allowing the algorithms to think, learn, and carry much of the model-building burden. That said, it is a mistake to think that data scientists can be decoupled from machine learning, especially for mission-critical models. Beware of the promise of simple-to-use machine learning functionality that can be applied without thought to the correctness, sophistication, or scalability of the technology under the hood -- this typically does not yield the high predictive accuracy and consequently high business value that machine learning has to offer. Worse, delivering poor models in an application may actually backfire and quickly build distrust in the product or service among its users.
Alexander Gray, Ph.D., is CTO at Skytree and associate professor in the College of Computing at Georgia Tech. His work has focused on algorithmic techniques for making machine learning tractable on massive datasets. He began working with large-scale scientific data in 1993 at NASA’s Jet Propulsion Laboratory in its Machine Learning Systems Group. He recently served on the National Academy of Sciences Committee on the analysis of massive data as a Kavli Scholar, and a Berkeley Simons Fellow, and is a frequent adviser and speaker on the topic of machine learning on big data in academia, science, and industry.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to firstname.lastname@example.org.