When Satya Nadella made machine learning the centerpiece of the Microsoft Build conference, I think it became official: 2016 is the year of machine learning.
All the major clouds now (or will soon) have machine learning APIs. In fact, InfoWorld's Martin Heller has already reviewed the machine learning services offered by AWS, Azure, and IBM Cloud. Even more telling, a couple of years ago only a handful of machine learning startups were out of stealth. Now there are -- what -- a thousand?
Oddly, Nadella's Build talk revolved around intelligent bots, which operate a little like today's highly annoying interactive voice response systems. Whereas the prevailing wisdom is that the greatest potential of machine learning arises when you apply it to big data.
Recently I encountered a machine learning startup with the grand ambition to structure the big kahuna of all unstructured data, the Web itself. A well-structured Web is not a new idea. In 2001, Tim Berners-Lee first proposed the Semantic Web, a framework that everyone would use to make the Web machine-readable, but it failed to catch on because it demanded too much manual labor from content creators.
Diffbot opts to create structure after the fact -- by emulating the way humans read and parse Web pages. To do this, Diffbot first crawls and indexes the Web using the open source search engine Gigablast. It then applies computer vision and natural language processing to convert Web pages and images into structured, machine-readable XML or JSON documents.
Started on a shoestring in 2009, Diffbot recently became one of the few profitable machine learning pure-plays, claims CEO Mike Tung. Customers include Cisco, eBay, Adobe, Sears, and Salesforce. Cisco, for example, uses Diffbot to scour user forums to determine what people are saying -- often in voluminous technical detail -- about Cisco routers and switches. The most common use, however, is to grab structured product information from across the Web for competitive analysis.
Because it renders Web pages as matrices of pixels and applies computer vision, Diffbot easily supports multiple languages and is not limited to extracting structured information from Web pages. According to Tung, Diffbot's AI algorithms can operate on anything that can be rendered into a 2D plane. He claims accuracy levels of 95 to 97 percent -- greater than the average accuracy of humans charged with the same information extraction task.
Today, customers typically feed Diffbot a range of URLs from which to extract information. But Diffbot is already moving toward the point where it can enable customers to run natural language queries. Tung claims the following:
We have something like 1.2 billion entities. This is larger than Google's knowledge graph. We have such comprehensiveness and coverage of structured data in this space that we're allowing people to make knowledge queries. Instead of passing in directly URLs or sites, you can say: Give me all of the products that are shoes that are below $70.
Our goal is to analyze the entire Web… You just ask Diffbot direct questions and get back the structured knowledge.
That might sound a lot like Google, but remember, Google's results are intended for humans to consume, not machines. Tung frames the potential in dramatic terms: "If you could build an AI that actually can, in a general way, synthesize knowledge just from viewing the sum total of human knowledge on the Web, then you could potentially build a transformative company that could enable an entire wave of intelligent applications."
Of course, it's worth noting that Diffbot is one of several startups in this field. Plus, it's perfectly conceivable that Google could turn its attention to providing similar services for business customers as part of its enterprise push.
Who wins isn't the point. The realization for me is that, finally, the largest and most unruly assemblage of knowledge ever created is about to become a lot more useful, as machine learning gradually erases the practical distinction between unstructured and structured data. Parsing the Web is the ultimate schema-on-read project. I can't imagine a better use of machine learning.