Demand for data by today’s business users is growing exponentially in two ways. First, business users have exhausted the opportunities in the data they hold. They want more sources of data to find new value, and they want the data to be accurate to deliver analytic outcomes. Second, the number of data-savvy business analysts is larger than ever and growing fast. To satisfy the increasing demand, IT departments must field a continuous stream of data requests—big and small.
While business execs see tremendous potential in putting more data at more users’ fingertips, organizations struggle to deliver that data. Traditional data integration tools, developed in the 1990s for populating data warehouses, are too brittle to scale to many data sources. They’re also prohibitively time-consuming and costly.
At the same time, new big data platforms only solve the data collection part of the problem. The data loaded into Hadoop is never enterprise-ready; it requires preparation before reaching a business user. The reality is that most enterprises are dealing with hundreds and thousands of sources. The only path to delivering clean, unified data across all of these sources is automation.
The critical importance of good data
Let’s take a simple example from the insurance industry. For instance, business analysts want to understand claim risk of an upcoming flood. They pull customers for the affected geographic region and filter for those with active home insurance. Immediately they need to drill into high property values and sparsely populated areas. This requires individual policies to be classified into more granular categories. Then they want a broad, accurate view so that they can correlate customers with both home and car insurance (in a different system, of course).
They also want to enrich the data with up-to-date property value information, so they need to bring in an external benchmark of real estate pricing. Finally, this analysis needs to be done in near-real time to take action. Waiting a quarter or even a full month is out of the question. In all of these steps, the analysts need to be working off of clean and trusted data in order to draw correct conclusions and make data-driven decisions. To summarize the challenges that need to be met:
- Granular categories (drilling-down into actionable categories)
- Broad, accurate view (correlating and segmenting across systems and silos)
- External benchmarks (pulling third-party pricing, market, or performance indicators)
- Near-real time (deriving continuous and actionable insight rather than quarterly and monthly historical analysis)
Data mastering and organizing
The two most challenging aspects of automating the delivery of data across many different sources are mastering and classification. Competent ETL engineers can do basic transforms like look-ups or minor calculations quickly and easily. But advanced tasks such as identifying global corporate entities or product categories across millions of records can't be easily scripted and maintained. We’ve all heard the example of matching “I.B.M.” to “International Business Machines,” but the problem is actually much more difficult. IBM has hundreds of subsidiaries, brands, and products. Mastering all of those together and bundling it up to be used by a businessperson is no small matching feat.
At Tamr, we’ve built a solution to automate these complex tasks to ensure rich and accurate data. Tamr uses a machine-driven but human-guided workflow to ensure the automation is efficient, accurate, and trustworthy. Tamr uses machine learning algorithms to predict how individual records should be classified and matched (as products, organizations, or individuals). For instance, if an invoice comes in with the description “Latex Gloves,” our algorithm might classify it as “Laboratory Supplies” and match it to the product “Rubber Gloves.” Tamr uses the entire record to “predict” these classifications and matches—everything from the description to the price to less obvious indicators like who created that invoice.
Machine-driven plus human-guided
To ensure that these predictions are accurate, we have a workflow for experts and users to give feedback. Tamr’s algorithms are built to iterate on the feedback. Under the covers, we use supervised learning techniques to tune weights and improve accuracy. A user might provide the feedback that “Latex Gloves” are not “Laboratory Supplies” because they are too expensive. The next time our system sees a record that looks similar, it will take this feedback into account when making a prediction.
Finally, we intelligently sample questions from the data to accelerate the workflow. From an afternoon of one expert answering key questions about the data, Tamr can build a broadly applicable algorithm to rapidly integrate the sources. Our sampling can also generate a prioritized queue of data exceptions for review. We train on every piece of feedback to reduce the number of questions we need to ask and maintain high levels of accuracy.
Active learning systems
While most machine learning systems are either supervised or unsupervised, our workflow of asking experts to tune an algorithm requires a combined approach. The underlying machine learning mechanics that allow us to use expert input is called “active learning.” Under the hood in Tamr, there are four steps in the machine learning process: parsed tokens, signals, recommendations, and linkage decisions. Our system uses static and heuristic modules for generating parsed tokens (that is, feature extraction) and computing linkage decisions (clustering). For generating signals and recommendations, the system uses mainly out-of-the-box supervised algorithms with some adjustments I’ll describe below.
These steps are relatively common for a machine learning system. However, three challenges arise when dealing with expert-provided input, where we’ve made significant changes.
The first challenge is biased training data (here we use the statistical term of bias—that is, not random). If Tamr’s algorithms were trained on all the labels provided by the experts, the algorithms would necessarily recommend more data points into the more popular categories. We have noticed that when experts label points, they have a bias toward positive labeling and will often bulk-label data points. To ensure that these biases do not overwhelm the algorithms, we’ve built a weighting scheme into our training labels.
The second challenge is identifying questions that will have the greatest impact. In Tamr’s workflow, the algorithm has the opportunity to ask experts to label specific data points, rather than a random sample. In selecting specific points we consider two criteria: Either the data point must be close to a decision surface, or the data point must be in an underrepresented area or category. We use stratified sampling techniques to do this selection optimally.
The third challenge is what we call the “N^2 problem.” Data integration mapping and matching tasks are usually combinatorial. If you want to ensure there are no duplicates in a data set of companies, you have to compare every company to every other company in the system. The same goes if you’re comparing schemas. Tamr has developed a “binning” step in our system to ensure we are not running our algorithms over the entire space of possible records and the algorithms are not asking the experts redundant and skewed questions. We solve these problems by dynamically creating a large hashing function and index. Rather than compare every single pair, we significantly limit the number of the comparisons by looking only at those that hash to the same value.
Rome wasn’t built in a day, and the intelligent unification of enterprise data isn’t going to happen overnight. Large enterprises are weighed down by old systems and processes. In some cases there are decades of legacy that stand in the way of truly data-driven decision making. Organizations must recognize not only that data plays a key role in delivering new business outcomes, but also that getting to good data takes effort.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to email@example.com.