The 80/20 data science dilemma

Most data scientists spend only 20 percent of their time on actual data analysis and 80 percent of their time finding, cleaning, and reorganizing huge amounts of data, which is an inefficient data strategy

data science certification man at computer
Thinkstock

The emergence of cloud has led to an explosion of data that has left data scientists in high demand. A job that didn’t exist a decade ago has topped Glassdoor’s ranking of best roles in America for two years in a row, based on salary, job satisfaction, and number of job openings. It was even dubbed the “sexiest job of the 21st century” by the Harvard Business Review.

Though growing in population, data scientists are scarce and busy. A recent study shows that demand for data scientists and analysts is projected to grow by 28 percent by 2020. This is on top of the current market need. According to LinkedIn, there are more than 11,000 data scientist job openings in the US as of late August. Unless something changes, this skills gap will continue to widen.

Against this backdrop, helping data scientists work more efficiently should be a key priority. Which is why it’s an issue that currently, most data scientists spend only 20 percent of their time on actual data analysis.

The reason data scientists are hired in the first place is to develop algorithms and build machine learning models—and these are typically the parts of the job that they enjoy most. Yet in most companies today, 80 percent of a data scientist’s valuable time is spent simply finding, cleaning and reorganizing huge amounts of data. Without the right cloud tools, this task is insurmountable.

Hard work behind the scenes

When beginning to grapple with and make sense of the many different data streams coming in via cloud-connected devices and systems, data scientists must identify relevant data sets within their data storage repositories, otherwise known as data lakes, which is no small task.

Unfortunately, many organizations’ data lakes have turned into dumping grounds, with no easy way to search for data and unclear strategies and policies around what data is safe to share more broadly. Data scientists often find themselves contacting different departments for the data they need and wait weeks for it to be delivered, only to find that it doesn’t provide the information they need, or worse, it has serious quality issues. At the same time, responsibility for data governance (or data-sharing policies) often falls on data scientists, since corporate-level governance policies can often be confusing, inconsistent, or difficult to enforce.

Even when they can get their hands on the right data, data scientists need to time to explore and understand it. The data may be in a format that can’t be easily analyzed, and with little to no metadata to help, the data scientist may need to seek advice from the data owner. After all this, the data still needs to be prepared for analysis. This involves formatting, cleaning and sampling the data. In some cases, scaling, decomposition and aggregation transformations are required before data scientists are ready to start training the models.

Organizational structure can also cause inefficiencies in the analysis process. Data scientists and developers traditionally work in siloes, with each group performing a related, but isolated task. This creates bottlenecks, increases the potential for error and dries up resources. A unified approach, which leverages cloud platforms and includes proper data governance, boosts efficiency and helps data scientists collaborate both internally and with developers.

Why it’s such a conundrum

These processes can be time-consuming and tedious, but they are crucial. Since models generally improve as they are exposed to increasing amounts of data, it’s in data scientists’ best interests to include as much data as they can in their analysis.

However, due to deadlines and time crunches, data scientists can often be tempted to make compromises on the data they use, aiming for “good enough” rather than optimal results.

However, making hasty decisions during model development can lead to widely different outputs and potentially render a model unusable when it’s put into production. Data scientists are constantly making judgment calls, and starting out with incomplete data can easily lead them down the wrong path.

To balance quality against time constraints, data scientists are generally forced to focus on one model at a time. If something goes wrong, they are forced to start all over again. In effect, they’re obliged to double down on every hand, turning data science into a high-stakes game of chance.

Escaping these pitfalls

Using cloud data services to automate many of the tedious processes associated with finding and cleansing data helps to give data scientists back more time for analysis, without compromising the quality of the data they use, and enables them to build the best foundation for AI and cognitive apps.

A solid cloud data platform features intelligent search capabilities to help data scientists find the data they need, while metadata such as tags, comments and quality metrics help them decide whether a data set will be useful, and how best to extract value from it. Integrated data governance tools also give data scientists confidence that they are permitted to use a given data set, and that the models and results they produce will be used responsibly by others.

As a result, data scientists gain the time they need to build and train multiple models simultaneously. This spreads out the risk of analytics projects, encouraging experimentation that yields breakthroughs without focusing resources on a single approach that may turn out to be a dead end.

Cloud platforms can also equip data scientists with services to save, access and extend models, enabling them to use existing assets as templates for new projects instead of starting from scratch every time. The concept of transfer learning—which focuses on preserving the knowledge gained while solving one problem and applying it to a different but related problem—is a hot topic in the machine learning world. Developing visualizations with data science tools help communicate how models work while saving time and reducing risk.

Data scientists play an essential role in pushing forward innovation and garnering competitive advantage for companies. By giving data science teams the cloud data tools needed to flourish today, the 80/20 dilemma becomes a thing of the past.

This article is published as part of the IDG Contributor Network. Want to Join?