Manage data responsibly, even if it's big

In the Wild West of big data, the rules of traditional information management don't always apply easily. But should you give up so quickly on data quality, governance, proper metadata, and security?

Sony's stock value fell off a cliff over the last decade. Don't let the same disruptive forces beat you. Credit: (c) Can Stock Photo

Modern data management technologies open up so many new possibilities that it becomes sometimes too easy -- and incredibly tempting -- to cross the guardrail that usually circumvents IT projects. The implications of these newfound possibilities are multiple, and can have a far-reaching impact.

Here are a few areas to watch for.

Unclear levels of data quality

A school of thought among Hadoop practitioners preaches for the concept of the "data lake." Basically, this consists of throwing in a Hadoop cluster any data point one can get their hand on, and “sort things out later” (quality controls, cleansing, enrichment, etc.). This approach needs to be contrasted with traditional information management techniques, which encompass careful filtering, cleansing, enriching -- essentially vouching for any data stored in the enterprise data warehouse, which therefore is a trusted data source.

Organizations implementing a data lake need to be especially careful in properly documenting the level of quality (or suspected non-quality) of any data stored in the lake. But even with these warnings, users will probably end up consuming unclean data...

Loose data governance

Because many big data projects are still exploratory, and are driven by agile/iterative methods, existing data governance processes typically don’t apply well. It is for example challenging to identify a business owner and designate data stewards for data that is not owned by the organization (such as social feeds).

Data stewardship can also be difficult to apply, and often proves to be a tradeoff with reactivity and real-timeliness. Tools used for data science, for data preparation, etc. are very immature and focus more on ease of use than on processes and traceability/auditability. As a result, data governance of big data oftentimes takes a back-row seat, which may be acceptable at the prototyping stage, but clearly needs to be remediated when projects become mission critical.

Unclear metadata

The schema-on-read mode, made possible in the big data world by NoSQL databases and of course Hadoop itself, offers added flexibility to developers. It also creates a soil that is the most favorable to data science, based on exploration of data and its relationships. But it also creates the opportunity for (almost) anything to be stored (almost) anywhere, leaving to the application the responsibility to understand and manage metadata. This is in some ways similar to the early days of data processing (mainframes) but also very different, since data is typically not owned by a single application nowadays.

Metadata, whether enforced by a relational database engine, or even just documented, is the guarantee that all applications will treat data the same way and use the same semantic rules. It may not be desirable for big data projects to replicate the constraining nature of RDBMS metadata, but centralizing and maintaining a metadata repository is a must-have.

Deficient security

Much has been written about security in Hadoop -- or rather the lack thereof -- and it is scary indeed to see how little is being done to secure big data projects. Nonetheless, security is one of the keys of proper data governance. Securing a big data project takes many forms, but probably the most sensitive one today is who gets access to which data.

In a traditional business intelligence infrastructure, a business analyst only gets access to records, or parts of records, that he has received clearance to view. But what’s the point of this clearance system if the same records are fully visible in Hadoop by the same business analyst who has just been promoted data scientist? I am not only talking about new data types/sources (social media for example can contain sensitive/private information) but also very traditional sources such as account details and transactions, which are inherently part of big data projects too.

Stringent data access rules can block the exploration process of a data scientist -- that much is true. But in this case, said data scientist must be especially trained and certified on the handling of sensitive information. And when a project goes live, the weak/nonexistent security layer of Hadoop must be reinforced and hardened.

This article is published as part of the IDG Contributor Network. Want to Join?

From CIO: 8 Free Online Courses to Grow Your Tech Skills
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.