Use the cloud to create open, connected data lakes for AI, not data swamps

There needs to be a material change in the way people think of solving complex data problems

data lake

Produced by every single organization, data is the common denominator across industries as we look to advance how cloud and AI are incorporated into our operations and daily lives. Before the potential of cloud-powered data science and AI is fully realized, however, we first face the challenge of grappling with the sheer volume of data. This means figuring out how to turn its velocity and mass from an overwhelming firehose into an organized stream of intelligence.

To capture all the complex data streaming into systems from various sources, businesses have turned to data lakes. Often on the cloud, these are storage repositories that hold an enormous amount of data until it’s ready to be analyzed: raw or refined, and structured or unstructured. This concept seems sound: the more data companies can collect, the less likely they are to miss important patterns and trends coming from their data.

However, a data scientist will quickly tell you that the data lake approach is a recipe for a data swamp, and there are a few reasons why. First, a good amount of data is often hastily stored, without a consistent strategy in place around how to organize, govern and maintain it. Think of your junk drawer at home: Various items get thrown in at random over time, until it’s often impossible to find something you’re looking for in the drawer, as it’s gotten buried.

This disorganization leads to the second problem: users are often not able to find the dataset once ingested into the data lake. Without a way to easily search for data, it’s nearly impossible to discover and use it, making it difficult for teams to ensure it stays within compliance or fed to the right knowledge workers. These problems mix and create a breeding ground for dark data: unorganized, unstructured, and unmanageable data.

Many companies have invested in growing their data lakes, but what they soon realize is that having too much information is an organizational nightmare. Multiple channels of data in a wide range of formats can cause businesses to quickly lose sight of the big picture and how their datasets connect.

Compounding the problem further, if datasets are incomplete or inadequate they often add even more noise when data scientists are searching for specific datasets. It’s like trying to solve a riddle without a critical clue. This leads to a major issue: data scientists spend on average only 20 percent of their time on actual data analysis, and 80 percent of their time finding, cleaning, and reorganizing tons of data.

The power of the cloud

One of the most promising elements of the cloud is that it offers capabilities to reach across open and proprietary platforms to connect and organize all a company’s data, regardless of where it resides. This equips data science teams with complete visibility, helping them to quickly find the datasets they need and better share and govern them.

Accessing and cataloging data via the cloud also offers the ability to use and connect into new analytical techniques and services, such as predictive analytics, data visualization and AI. These cloud-fueled tools help data to be more easily understood and shared across multiple business teams and users—not just data scientists.

It’s important to note that the cloud has evolved. Preliminary cloud technologies required some assembly and self-governance, but today’s cloud allows companies to subscribe to an instant operating system in which data governance and intelligence are native. As a result, data scientists can get back to what’s important: developing algorithms, building machine learning models, and analyzing the data that matters.

For example, an enterprise can augment their data lake with cloud services that use machine learning to classify and cleanse incoming data sets. This helps organize and prepare it for ingestion into AI apps. The metadata from this process builds an index of all data assets, and data stewards can apply governance policies to ensure only authorized users will be able to access sensitive resources.

These actions set a data-driven culture in motion by giving teams the ability to access the right data at the right time. In turn, this gives them the confidence that all the data they share will only be viewed by appropriate teams.

Disillusioned with data? You’re not the only one

Even with cloud services and the right technical infrastructure, different teams are often reluctant to share their data. It’s all about trust. Most data owners are worried about a lack of data governance—the management of secure data—since they have no way of knowing who will use their data, or how they will use it. Data owners don’t want to take this risk, so they choose to hold onto their data, rather than share it or upload it into the data lake.

This can change. By shifting the focus away from restricting usage of data to enabling access, sharing and reuse, organizations will realize the positive value that good governance and strong security delivers to a data lake, which can then serve as an intelligent backbone of every decision and initiative a company undertakes.

Overall, the amount of data that enterprises need to collect and analyze will continue to grow unabated. If nothing is done differently, so will the problems associated with it. Instead, there needs to be a material change in the way people think of solving complex data problems. It starts by solving data findability, management and governance issues with a detailed data index. This way, data scientists can navigate through the deepest depths of their data lakes and unlock the value of organized and indexed data lakes—the foundation for AI innovation.

Copyright © 2017 IDG Communications, Inc.