Machine learning: A practical introduction

How new tools and techniques are extracting business insights from massive data sets

1 2 Page 2
Page 2 of 2

H2O is fast. H2O runs on a distributed in-memory framework. Your machine learning workload runs entirely in memory, avoiding the disk I/O bottleneck. Plus, you can distribute the workload across as many servers as necessary for the performance you need.

H2O is easy to implement. H2O runs on a variety of platforms: Windows, Mac, and Linux clusters; on Cloudera, MapR, and Hortonworks Hadoop under YARN; on Apache Spark; on Amazon EC2, Google Compute Engine, and Microsoft Azure. To deploy H2O, you simply download the software to your preferred platform and install it.  

H2O builds better models. H2O’s machine learning algorithms detect complex interactions that would be difficult to find using conventional methods, such as linear regression. Since H2O is horizontally scalable, you can perform analysis with all of your data in a single pass.

One of the leading insurance carriers in the United States used to perform retention analysis in SAS/STAT. To fit the data into SAS, they had to run the analysis separately for each state, which took an entire weekend. With H2O, they run the analysis only once. By modeling their entire book of business at once, they identify patterns that cannot be detected from state-level analysis. The result: more accurate models and a more effective retention program.

H2O integrates easily with your big data stack. H2O is open source software, which means you can examine the source code and, if necessary, modify it to work in your environment. H2O works with the leading Hadoop distributions, and it runs under YARN.

For example, PayPal uses H2O because it works seamlessly with other big data frameworks, including Hadoop distributions and open source languages.

Integral Ad Science uses H2O as part of a complex stack of applications -- including Cloudera Hadoop, Spark, HBase, MySQL, Kafka, Storm, Hive, Impala, Pig, Java, JavaScript, Python, and R -- to understand how consumers interact with digital advertising.

And Comcast uses H2O together with Spark to deliver personalized recommendations for video content to its subscribers. The system updates program recommendations every 20 seconds through Spark Streaming.

H2O puts insight into production. Predictive models provide value to the organization when they drive operational decisions. Unfortunately, commercial software bottles up those insights in a proprietary package that can take months to put into production. H2O exports POJOs -- Plain Old Java Objects -- that are easy to integrate into an operational pipeline.

H2O simplifies machine learning. Machine learning used to require a lot of custom programming -- even building algorithms from scratch. In addition to prebuilt and pretested algorithms, H2O includes many other features that save the data scientist valuable time. They include missing value treatments, categorical data handling features, regularization capability, automated grid search, and automatic cross-validation.

H2O is true open source software. H2O is an open source project of, which distributes the software under an Apache license. There are no gimmicks, such as stripped-down “community editions” or “freemium” software you have to pay for after an evaluation. offers commercial support to enterprises seeking a defined SLA, private JIRA, access to’s team of data scientists, H2O Quick Start, and H2O DevOps. (If you’re interested in taking part in our mission to bring machine learning to the masses, please check out our GitHub repositories.)

At H2O, we believe that machine learning will become as ubiquitous, easy to use, and powerful as search. Google, Yahoo, and others helped unleash the power of the Web for ordinary users by making it easy to find relevant results from a seemingly limitless number of pages. Similarly, machine learning will allow businesses of all kinds to tap into the power of modern data sets by making it easy to get to valuable insights.

However, we’re obviously not there yet. Getting there will require further investments -- both from machine learning developers like H2O, and from business users whose volumes of data and needs for analysis outstrip conventional methods. 

SriSatish Ambati is co-founder and CEO of

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to

Copyright © 2015 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2