Key steps to model creation: data cleaning and data exploration

By following best practices and philosophies around these processes, an organization can enable successful collaboration and iteration between data science and IT teams

abstract data stream

The explosion of data in the modern world has brought on many novel business problems when It comes to the applications of modeling and analysis. Businesses are starting to recognize the value that mature, robust analytics practice can bring to both their understanding of the industry, and their bottom line.

As I’ve detailed in previous articles, such as “Handing off models from data science to IT,” the relationship between IT teams and data scientists can lead to complications in model creation and deployment. One of the key themes I used to illustrate the needs between these two groups is collaboration. Which is quite the coincidence, because while coming together as an organized unit can be incredibly beneficial, so can focusing on the specific needs and wants for each group. In order for the IT and data science teams to collaborate, they need to be able to maintain performance in their lanes. This means letting the IT team work on IT and having the data scientists be scientists. It’s time we talk about the data science role.

Data cleaning

To get us started, let’s focus on one of the lesser known steps: data cleaning. Data cleaning, sometimes referred to as data munging or exploratory data analysis, explains the process of examining raw data and condensing it down to a more usable form. I’d argue that this is actually one of the most essential aspects of a successful data science project.

You can explain its importance in two ways: the value it provides the data scientist in terms of useful inputs for subsequent models, and the knowledge and subsequent learning it provides both the data scientist and its client about the data itself and the underlying process that generated it, whether the data comes from a web generated survey, an instrument senor, or a credit card transaction.

The process of data cleaning is instrumental in revealing insights into the data that will eventually translate into reveal value for the end user. Understanding what is going on is key to the development and delivery of the data science solution.

In a real-life scenario, let’s assume the data scientist receives data in a format for analysis and examination. This can be a pure text file (.txt or .csv), JSON or some other readable format, or a binary format that can be read and converted to a more readable one. Rarely is data at this stage in a form that is directly usable by the data scientist. 

Once the data comes through, the first step is to characterize the nature of the fields. They’re usually arranged as records, one per line, with several fields or variables per record; this is typically in a matrix format. Each field or record should include the following as a way to characterize it:

  • Is it character data (e.g., name and address), logical, or numeric?
  • If numeric, is it integer or floating point?

Getting through this stage should uncover and address two potential problems with the raw data – outliers and missing values. Outliers can be things like miscodings (e.g., a character value where a numeric is expected) or can actually be observations that are outlying from the bulk of the data. Missing values can be recoded to a null value, recoded to a data value (typically the mean of the field), or removed from the data set, although this is usually not recommended. 

The next step and goal is to identify potential aggregations of data that will be useful in the modeling process. This is where the goal of the project should take centerstage and inform what is being identified. It is important for the data scientist to be in close contact with the client or end user at this stage.  Questions that arise concerning variable formats, the nature of the variables, how they were generated and collected, and possible data dependencies, should be resolved quickly as to not impede progress.

Data exploration

So now you have the data, and it’s in a format that’s suitable to begin working with. Now what?

While the goal of data cleaning is to prepare the data for use in modeling, it’s time to turn your attention to data exploration. This stage is all about to uncovering patterns and relationships in the data. This process is invaluable to business and can provide insights that could be previously unknown relationships between features, other actionable phenomena, or potentially even that the goal of the modeling project cannot be achieved with the data available.

A critical feature of success at this stage is the data science team’s capability to rapidly iterate both in data manipulations and generation of model prototypes. By necessity, data exploration involves experimentation by the modeler. As a consequence, any limitations, either in tools available or system performance, may negatively impact the modeling effort.

At the same time, the modeling environment should be checkpointed so that every step in this iterative process is captured and reproducible. Containerized modeling environments (such as a Jupyter notebook server running in a Docker container) strike a good balance between flexibility and reproducibility, and there are already Docker container images for many common modeling tools, such as Python, R, Scala/Spark, and even SAS.

Depending on the background and experience of the data science team, this can also be an excellent time to involve subject matter experts from elsewhere in the organization. Intuition and experience from subject experts can help direct data exploration, as well as interpret patterns and relationships discovered in the data. The best models incorporate intuition and knowledge about underlying mechanisms relating the data and response.

Both data cleaning and data exploration are key steps in the model creation process, and by following best practices and philosophies around these processes, an organization can enable successful collaboration and iteration between data science and IT teams.

Copyright © 2018 IDG Communications, Inc.