When a flood of data threatened to swamp Evernote's analytics system, the company modernized its analytic environment to handle big data -- without breaking the budget. The provider of popular personal organization and productivity applications has moved from a conventional data warehouse to a modern hybrid of Hadoop and ParAccel, a massively parallel processing (MPP) analytic database.
Evernote collects and analyzes enormous quantities of data about its users. Since 2008, more than 36 million people have created Evernote accounts, generating hundreds of terabytes of data, including 1.2 billion "notes" and 2 billion attachments. Notes can be text, Web pages, photos, voice memos, and so on -- and they can be tagged, annotated, edited, sorted into folders, and manipulated in other ways.
To determine how to optimize the Evernote user experience, the company now analyzes 200 million events per day through a combination of Hadoop and the ParAccel database. In addition, the open source JasperReports Server Community Edition is used for generating reports and charts.
Hitting the data volume limit
Evernote's original data warehouse was based on an OLTP (online transaction processing) relational database. The data warehouse used a star schema, meaning the data is arranged with respect to queries as opposed to transactions. This solution is usually good enough for reporting and some analysis as long as data volumes on MySQL top out at several terabytes. But as data spills over that limit, less history can be retained and data warehouse query speeds, flexibility, and affordability suffer.
This was the case with Evernote, whose data warehouse was powered by MySQL on a large RAID10 array on the same network as the application's main servers. Every night, over a period of 9 to 18 hours, a series of batch processes dumped increments of several operational database tables combined with parsed structured event logs from the application servers. After manual creation and tuning, reports were then distributed by email.
By the beginning of 2012, however, the analytics team realized its existing solution could no longer handle the load. With a primary table of more than 40 billion rows, it was impractical to access more than a few days' data at a time. The reporting database was slow, difficult to maintain, and did not lend itself to ad hoc queries.
The Evernote team set out to build an analytics environment that could efficiently store the full history of the data, generate dozens of standard daily reports, easily manage ad hoc queries, and continue to scale in the future. And it needed to respect a tight budget.
Baking a hybrid analytic environment
Evernote settled on a new approach composed of Hadoop and ParAccel.
A 10-node Hadoop cluster now stores all historical data and handles data preparation for analytics. Hadoop was seen as an affordable solution, thanks to open source licensing and ability to scale out using commodity hardware.
As an MPP analytic database, ParAccel excels at superfast ad hoc queries. At Evernote, a three-node ParAccel columnar analytic database handles queries over sets of derived tables. The nodes are SuperMicro boxes, each with dual L5630 quad-core processors, 192GB of RAM, 10Gbps networking, and a RAID5 of solid-state drives manually provisioned and configured with Red Hat Enterprise Linux.