Data lakes: A better way to analyze customer data

Early adopters share their experiences using data lakes to find patterns in 'good enough' information

1 2 Page 2
Page 2 of 2

Lakes aren't about data perfection

Synchrony Financial, a consumer finance company that provides private-label and co-branded credit cards through Synchrony Bank, currently operates data warehouses and a data lake. Although the firm's data lake is in pilot mode, CTO Greg Simpson expects heavier usage in the near future.

Simpson says he has the good fortune that most of the data coming from other financial sources into the lake is clean and standardized since the industry is highly regulated. The data lake will be instrumental in integrating social media data to foster a deeper dive into customer behavior and market trends.

"I would prefer to have clean data, but I've gotten over that," he says. "The reality is that we need to be able to do analysis to optimize our current business and find adjacent businesses. That means, no, we're not going to normalize and massage and create this master data model and data mart."

As an example, in analyzing a customer's shopping habits to figure out how to market to them, Synchrony doesn't need to know with precision if the customer shopped more on a 78-degree than on a 79-degree day. "We just need to know that it was a pretty nice day," Simpson says. With that information, Synchrony can determine when a customer would want to see a store's offer pop up on his or her smartphone.

Synchrony uses Hadoop as the framework and analysts use SQL as the interface to pull in data from the company's ecosystem of merchants as well as external sources. "Hadoop has become commercialized to the point where tools sets are available to easily implement it," Simpson says.

For now, the data lake will grow project by project, with more data being brought in as needed. Eventually, though, there will be enough data in there that data scientists can study the lake as a whole and find even more value.

For instance, Project A might focus on how Synchrony targets and markets to consumers and Project B on analyzing call center data and optimizing it. "Project C, which we didn't even think of, could arise from data from both of these projects as well as other sources in the lake," he says.

Simpson is a fan of the data lake concept because it can avoid the more complex and long-term tasks associated with data warehouses. "Our data warehouses are often used for highly repeatable, less frequent things like monthly financial reporting. These are point-in-time questions that we are not going to go back to again, such as whether something is a popular color right now," he says. "If it takes you six weeks to act on it, then you'll lose out on that business."

Understanding lakes' context and metadata

One issue perplexing Simpson is how to understand context once the data is in the lake. For instance, if he pulls in Facebook posts and wants to assess the level of negativity in them, he would want to know whether one out of 10 posts were bad, or one out of 1,000, and where the comments were posted.

Eric Fegraus, senior director of technology and external relations at environmental nonprofit Conservation International, has similar concerns about metadata now that he plans to create a data lake.

Data that currently gets siloed in government agencies, universities, and nonprofit institutions would be shared in the data lake. "Traditionally in the natural resources world of forestry, biodiversity, ecology and marine ecosystems, there is a tremendous lack of data," Fegraus said. That is starting to change because of sensors, cameras and other IoT devices that can feed data captured remotely back to a central repository.

Fegraus wants to develop and implement best practices for data capture soon so that information gathered by scientists doesn't disappear when their funding ends or they switch projects. "We are actively building a system that will enable data repositories to share and integrate data. It will function like a lake but with many interconnected nodes," he says.

To pilot the project, Fegraus intends to populate an initial node of the data lake with images, sound, and metadata from the organization's thousand cameras set up in the wild across the world. "The data enables us to understand what is happening with wildlife populations and provides land managers with data-driven insights into the status and trends of wildlife populations on their lands," he explains.

"We can also start to tease apart what could be driving the trends we are seeing," Fegraus says. In one wildlife park in Uganda, after cameras were set up, personnel started noticing a decline in the golden cat species.

"We could also tell that there was a strong signal that human presence could be impacting this particular cat," he explains. "Well, it turns out that this park is sustained by gorilla ecotourism and tourists hike along trails to go see the gorillas. They connected the golden cat decline with shifts in their use of trails in the park, and they now have insight into how better manage their park so as to not impact the golden cat."

While his team will use the data for their projects, other organizations will have access to the same data set.

But, like Simpson, Fegraus anticipates the metadata being tricky. Data use agreements among participating organizations most likely will stipulate the use of metadata to maintain the integrity of the data. For example, whether the scientist used bait or a flash with the camera is essential information that could impact outcomes, yet keeping it with the raw data could prove challenging.

Dealing with the 'bottomless' notion

Another hurdle: What to collect and how long to keep it. "There's so much data you could collect, but you'll run out of space and there's a cost to it," Fegraus says. Therefore, the data lake would likely be filled with project-driven data and not just any available data.

While data lakes seem bottomless, they are not, according to Svetlana Sicular, research director for data management strategies at Gartner. "People get nervous that they might lose something -- so they collect everything they can. Then they also get very nervous because they need to show the value of the rapidly growing data lake. But the value is in the analytics," she says. And companies that treat data lakes as "write-only" will fail; in other words, people need to both read and write to the information to make the best use of it all.

"There is a common notion that data warehouses will go away and you will do everything in the data lake," Sicular says. "That is a fallacy. Why would do something with technology that it was not designed for?" She adds that data lakes are only cost-effective if they are used the right way.

And she cautions organizations to look more carefully at their data warehouse and make sure it is not the right tool before going the lake route. Many people think of data warehousing as it was five years ago, Sicular says, but a lot of data warehouses today are capable of processing unstructured data. Also, she advises companies to consider how much of the data they want to analyze actually is unstructured.

The competitive advantage

So when should companies use data lakes? "If you need to analyze data of various types that does not make sense to store in the data warehouse," Sicular says. Another use case: "If taking the time to cleanse data would put you behind competition, then that is the perfect use case for a data lake," she adds.

That's exactly the rationale that drives International Trucks' use of a data lake. Andy Minteer, director of Internet of Things analytics and machine learning at the Navistar-owned truck maker, says data flowing into a Hadoop-based data lake enables the company to stay one step ahead of its competitors.

International Trucks has more than 160,000 vehicles enrolled in its OnCommand Connection program, which uses data from sensors streaming in from vehicles, including trucks and school buses, every 15 to 20 seconds to assess a fleet's health.

For instance, by analyzing the raw data in the data lake, the company was able to help a school bus fleet manager determine the threshold voltage for batteries to be replaced so the buses wouldn't break down on a cold day, leaving kids stranded outside.

His team also has developed an algorithm to comb through more than 40,000 combinations of vehicle types and fault codes (unstructured data) to assist smaller fleets with preventative maintenance schedules. Minteer studied the raw data of these on-highway fleets, which tend to drive higher-mileage vehicles cross-country, and figured out which issues were likely to arise and when -- so they could schedule repairs and avoid downtime.

"It's a race to get the value and opportunity, but the data lake tools do that now easily and cost effectively," he says. "We now know that data being available is more important than data being in a certain format."

This story, "Data lakes: A better way to analyze customer data" was originally published by Computerworld.


Copyright © 2016 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
InfoWorld Technology of the Year Awards 2023. Now open for entries!