There's a bit of absurdity here. If you throw it away, you can't get it back; if you keep it, you can eventually organize and purge what you don't need. Those who store data now while getting their governance in place are not automatically "data hoarders." This is a false dilemma.
The idea that you need to come up with a perfect plan before keeping any data or bringing in any new sources is a little like saying we need perfect social justice for everyone before we can address police killings of African-Americans.
Instead, get started now. Stop throwing out the baby with the bathwater and begin finding your use cases. Meanwhile, make data the point rather than a side effect of your processes and govern it accordingly. These aren't "steps," but initiatives you need to undertake, usually in parallel.
That said, how do you go about planning? How do you start cataloging your data and establish some structure around its evolution? There are traditional solutions like those covered in last year's Forrester Wave report -- Informatica, various IBM offerings, SAS, and Collibra, among others -- but some of these come with a lot of baggage and form part of a vendor's overall platform play.
Meanwhile, a new class of data governance tools is being developed specifically for Hadoop. These tools have less of a legacy, but are also less mature. They are focused on the Hadoop ecosystem rather than your whole organization, allowing you to integrate them more closely with your new data architecture.
Navigator is Cloudera's closed-source offering for data governance. It incorporates both security auditing and metadata management, and it allows both integration with traditional data governance products like Informatica and automated data lineage tracking.
At its core, it tracks where data came from, what transformations happened to it, where the data landed, and where the heck it's located. You can even set up rules (policies) for automatically tagging data based on its type and origin.
Navigator also allows you to trigger actions based on these policies, some of which aren't necessarily best done in Navigator (for example, triggering actions to archive or move data). Among the biggest concerns is that you can trigger auditing with or without Sentry, Cloudera's authorization module for Hadoop.
On the one hand, "choice is good," but on the other hand, if you go to the condiment counter at a fast food joint and find 15 brands of generic ketchup packages, which do you choose? I don't really need multiple paths for an audit implementation because...I just want to log the stuff already and I don't care about choice for that.
Hortonworks is newer to the data governance game. It has proposed Apache Atlas, which was accepted into Apache's incubator -- sometimes, but not always, a sign of project maturity. The rise to a top-level Apache project is a very political process.
Atlas has high hopes, but it's pretty early on in its development. It integrates with Apache Ranger according to the README.txt, though that's the only use of the word "Ranger" in the whole source repository, and it isn't a lot of code. While Atlas is part of Hortonworks' recent 2.3 release, it's clearly an early cut, and probably not the core of your master-data-management or data governance initiative at this point.
The buyer's lament
With Sentry versus Ranger and Navigator versus Atlas, you're seeing a real split. On one hand Cloudera offers a mature more complete offering; on the other hand it's proprietary and already diverging from the less mature, less-thought-out Sentry product. Hortonworks answers with an open source offering, but obviously, it integrates with its own preferred security implementation.
In other words, we're seeing a sort of Hadoop distribution lock-in with each new layer we add. Part of why we pick an open source technology is to put the choice back in the user's hands.
Neither Navigator nor Atlas are particularly complete offerings, and while it's nice that Navigator can work with existing data governance offerings such as Informatica, these have their own plug-ins, anyhow.
You have to ask: Do I need a Hadoop data governance solution or do I need a complete data governance solution that includes Hadoop? In many cases, I'd say the latter.
It would be nice to see full-on open source data governance software. But for now, if you look at a complete, mature, and proprietary tool like Collibra, which offers a complete vision, you're unlikely to be happy even with Navigator. It would probably easier for Collibra to deepen its Hadoop integration and offer better data lineage than for Cloudera to make Navigator a more complete offering. If you're using a proprietary product anyhow, you might as well use a complete one that covers all of your data (and if you have a lot of it, you probably have Informatica anyhow).
Someday a complete open source data governance or master data management tool will emerge. But it can't be aligned with a single technology vertical. I mean, I don't really want Data Governance for Hadoop, Data Governance for MongoDB, Data Governance for Oracle and a freaking data lake project just to tie back together my metadata from my data governance tools.
The catch with many existing tools is they are heavy duty and suited to bureaucratic organizations that hold long-winded data governance committee meetings. For organizations just getting into data governance, who simply need to stop digging, the implementation costs can be daunting.
Whichever governance software you choose, remember that owning a hammer doesn't make you a carpentry business, just as having a data governance tool doesn't make your initiative happen. Governance is really about your processes -- the actual gathering and cataloging of data and how you think about data.
Meanwhile, whatever you do, don't listen to the naysayers and throw your data away because you haven't figured out how to govern it yet. That's like killing the patient because treatment is a lot of work.