New big data tools for machine learning spring from home of Spark and Mesos

RISELab, the successor to UC Berkeley's AMPLab, also has projects for predictive analytics, security, and context management for data lakes

New big data tools for machine learning spring from home of Spark and Mesos
Credit: Pixabay

If the University of California, Berkeley's AMPLab doesn't ring bells, perhaps some of its projects will: Spark and Mesos.

AMPLab was planned all along as a five-year computer science research initiative, and it closed down as of last November after running its course. But a new lab is opening in its wake: RISELab, another five-year project at UC Berkeley with major financial backing and the stated goal of "focus[ing] intensely for five years on systems that provide Real-time Intelligence with Secure Execution [RISE]."

AMPLab was created with "a vision of understanding how machines and people could come together to process or to address problems in data -- to use data to train rich models, to clean data, and to scale these things," said Joseph E. Gonzalez, Assistant Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley.

RISELab's web page describes the group's mission as "a proactive step to move beyond big data analytics into a more immersive world," where "sensors are everywhere, AI is real, and the world is programmable." One example cited: Managing the data infrastructure around "small, autonomous aerial vehicles," whether unmanned drones or flying cars, where the data has to be processed securely at high speed.

Other big challenges Gonzalez singled out include security, but not the conventional focus on access controls. Rather, it involves concepts like "homomorphic" encryption, where encrypted data can be worked without first having to decrypt it. "How can we make predictions on data in the cloud," said Gonzalez, "without the cloud understanding what it is it's making predictions about?"

Though the lab is in its early days, a few projects have already started to emerge:

Clipper

Machine learning involves two basic kinds of work: Creating models from which predictions can be derived and serving up those predictions from the models. Clipper focuses on the second task and is described as a "general-purpose low-latency prediction serving system" that takes predictions from machine learning frameworks and serves them up with minimal latency.

Clipper has three aims that ought to draw the attention of anyone working with machine learning: One, it accelerates serving up predictions from a trained model. Two, it provides an abstraction layer across multiple machine learning frameworks, so a developer only has to program to a single API. Three, Clipper's design makes it possible to respond dynamically to how individual models respond to requests -- for instance, to allow a given model that works better for a particular class of problem to receive priority. Right now there's no explicit mechanism for this, but it is a future possibility.

Opaque

It seems fitting that a RISELab projects would complement work done by AMPLab, and one does: Opaque works with Apache Spark SQL to enable "very strong security for DataFrames." It uses Intel SGX processor extensions to allow DataFrames to be marked as encrypted and have all their operations performed within an "SGX enclave," where data is encrypted in-place using the AES algorithm and is only visible to the application using it via hardware-level protection.

Gonzalez says this delivers the benefits of homomorphic encryption without the performance cost. The performance hit for using SGX is around 50 percent, but the fastest current implementations of homomorphic algorithms run 20,000 times slower. On the other hand, SGX-enabled processors are not yet offered in the cloud, although Gonzalez said this is slated to happen "in the near future." The biggest stumbling block, though, may be the implementation, since in order for this to work, "you have to trust Intel," as Gonzalez pointed out.

Ground

Ground is a context management system for data lakes. It provides a mechanism, implemented as a RESTful service in Java, that "enables users to reason about what data they have, where that data is flowing to and from, who is using the data, when the data changed, and why and how the data is changing."

Gonzalez noted that data aggregation has moved away from strict, data-warehouse-style governance and toward "very open and flexible data lakes," but that makes it "hard to track how the data came to be." In some ways, he pointed out, knowing who changed a given set of data and how it was changed can be more important than the data itself. Ground provides a common API and meta model for track such information, and it works with many data repositories. (The Git version control system, for instance, is one of the supported data formats in the early alpha version of the project.)

Gonzalez admitted that defining RISELab's goals can be tricky, but he noted that "at its core is this transition from how we build advanced analytics models, how we analyze data, to how we use that insight to make decisions -- connecting the products of Spark to the world, the products of large-scale analytics."