Evernote buys into big data analytics for a song

To collect and analyze data on 200 million daily events, Evernote transitioned from a MySQL data warehouse to a hybrid environment of Hadoop and ParAccel

Page 2 of 2

Finally, as one of the most popular open source reporting solutions, JasperReports was an easy call. The team chose Jaspersoft's open source JasperReports Server for querying the ParAccel server and generating dozens of daily reports in a variety of formats. (Recently, the combination of ParAccel and JasperReports Server received a de facto endorsement from Amazon, which uses the two to power its Redshift hosted analytics environment.)

Evernote uses JasperReports Server to generate dozens of charts and reports every day.
Evernote uses JasperReports Server to generate dozens of charts and reports every day.

For security reasons, the analytics environment is on a separate network with no connections to the production application servers. Daily online data is securely pushed into the reporting environment through a one-way network connection.

Building the Hadoop deployment
All raw data first goes to Hadoop, where it is both archived and prepared for loading into ParAccel for daily reporting as well as ad hoc analysis. Evernote uses the Cloudera Hadoop distribution, with Puppet employed for configuration management.

The Hadoop cluster includes six data nodes with eight 500GB drives, for a total of 24TB of raw storage. Two eight-core processors and 64GB of RAM run 132 MapReduce tasks across the cluster with more than 2GB of RAM for each task.

In addition, Evernote runs a single Hadoop Job Tracker on a pair of servers for redundancy, along with one client node for running Hive and Hue, two key open source tools for Hadoop. The Hadoop cluster is accessed through the Hive abstraction layer, which provides a SQL-like interface for querying. Hue is a Web-based interface for Hadoop that includes a number of utilities, such as a file browser, a job tracker interface, a cluster health monitor, and more -- as well as an environment for building custom Hadoop applications.

Working together
User activity data captured from Hive is loaded into ParAccel every night, along with the reference tables from the online production database. Using Hive, derived tables are created that contain presliced information for optimal representations in common reports. For example, a country summary table contains just one row per country each day with a sum of the daily, weekly, and monthly active users as of that date.

This ParAccel database and its tables are tuned for quick aggregation of data, so Evernote can answer many types of questions much faster than using Hive alone. For example, it takes three seconds to see which versions of Evernote Windows were most widely used in Germany during a particular week.

Now the team has a modern analytic environment with room to grow. Thanks to Hadoop, the team can archive unprecedented quantities of operational and log data -- and, more important, load and transform hundreds of millions of records in two hours instead of the 10 or more hours that were once required. Thanks to ParAccel, Evernote can perform much more complex analyses of user trends, with JasperReports Server delivering the final, polished results.

The ability to store all historical data, achieve fast ad hoc queries, and automate quality reports on a daily basis is giving Evernote new insight into how its customers use its products -- and how those products can be continually improved.

This article, "Evernote buys into big data analytics for a song," was originally published at InfoWorld.com. Read more of Andrew Lampitt's Think Big Data blog, and keep up on the latest developments in big data at InfoWorld.com For the latest business technology news, follow InfoWorld.com on Twitter.

| 1 2 Page 2