MapD's GPU-powered database is now open source

Horizontal-scaling and high-availability features are still for-pay only, but MapD wants its number-crunching platform to be integral to data science

MapD's GPU-powered database is now open source

MapD, creator of a GPU-accelerated database that scales both up and out, has open-sourced its core technology.

As announced in a press release and blog post, the core database and its "associated visualization libraries" are available under the Apache 2.0 license. But enterprise-level features like the high availability, LDAP, ODBC, and horizontal scaling functionality—many of which debuted in the 3.0 version released earlier this month—will be kept close to the chest.

Core concerns

Of the key pieces being open-sourced, the first and most crucial is the MapD Core Database, since it includes the basic bits needed to perform SQL processing on however many GPUs are available from a single server.

"We wanted the community to be able to take advantage of our core technological innovations, and that meant open sourcing essentially the entire core of the system," explained Todd Mostak, co-founder and CEO of MapD Technologies. "We are also giving away a free binary with a noncommercial license to our server-side GPU rendering technology and our Immerse visual analytics client, but we are not open-sourcing those pieces."

Mostak maintained that even a single node outfitted with multiple GPUs can provide enterprise-class performance. "We have customers doing subsecond queries on 5+ billion record datasets on a single node," he wrote.

However, developers may be irked that initial Core Database releases are for Linux and MacOS, and there's currently no official build instructions for Microsoft Windows. Mostak stated that most of MapD's customers are on Linux, so any near-term port to Windows would have to be a community effort.

That said, if customers wanted to run the project as a desktop application on Windows, MapD "may consider pitching in ourselves to help build it if there is enough demand," said Mostak.

The other key open source release is a set of front-end, browser-based visualization libraries, "so that users can build custom visualization apps on top of MapD," said Mostak.

Getting to open

MapD's strategy is in line with that of other companies wooing enterprises with products based on open source: Offer the core technology as-is, build on top of it, and send changes upstream—but sell the most monetizable, enterprise-centric parts as proprietary bits.

The approach reflects the realities of building a business atop open source, as it's become clear that pure open source warm the hearts of fellow advocates but can be strategically naïve. Enterprise Hadoop outfit Hortonworks long trumpeted its pure-play approach, with support and services as the big moneymakers a la Red Hat. But of late, Hortonworks has been toying with the idea of adding proprietary offerings to its lineup, along the lines of competing Hadoop solutions that haven't been shy about such moves.

Mostak was optimistic that an open source release "will only accelerate our traction." "In terms of customer base," he wrote, "I'll say we are rapidly ramping, doing a dozen deals in Q1 alone, not counting AWS cloud customers."

Some of the features only available in the commercial edition, like the ability to scale horizontally, would be theoretically possible to duplicate with open source, but Mostak is skeptical it could be done without sacrificing performance. "I'm not saying it couldn't be done," he wrote, "but it would be difficult to use an existing framework and still be fast."

Going open source, Mostak claimed, had been a goal of the project from the beginning.The move also helps it establish a common platform for GPU-powered analytics that others could participate in. The company is partnering with two outfits, Continuum Analytics (creators of the stats-and-analytics-centric Anaconda distribution of Python) and machine learning/AI specialists, to create the GPU Open Analytics Initiative, with the goal of "[creating] common data frameworks enabling developers and statistical researchers to accelerate data science on GPUs."