Governance and Discovery
Data Governance sounds like a candidate for the most boring topic in technology: something dreamed up by middle-managers to add friction to data scientists’ lives. The funny thing about governance, though, is that it’s closely related to data discovery. And data discovery is neither dull nor additional friction; it’s an exciting process that enables great data projects, ranging from traditional reporting to artificial intelligence and machine learning.
The idea of data governance originated in regulation and compliance. Not that long ago, data was a “wild west”: there were few rules and regulations about how it could be used or transferred, and most were industry-specific. That started to change with HIPAA, which covered medical data (though not much else). It changed in a big way with Europe’s GDPR, which enacted stringent requirements for how data is used and how individuals control the use of data about themselves; it also provided significant penalties for organizations that disobeyed the rules. In the US, California enacted a data privacy law (CCPA) that is similar to GDPR in many ways, and other states are likely to follow.
The need for data governance is simple. People who work with data need to take those regulations into account. They need to track the data they have, where it came from, who was allowed to modify it, and how it was modified. If their dataset merges multiple data sources, they have to track those other sources. They need to be able to find and delete data on short notice if a customer requests it (for example, by exercising the GDPR’s “right to be forgotten”). They need to know how the data was collected—not just whether consent was requested and granted, but how their data sources were chosen. Who (or what) appears in the dataset? Are the data sources biased, and how might those biases affect results? And this requires a set of tools that is more sophisticated than dumping the data into a data warehouse or submerging it in a data lake.
But a funny thing happened. At the same time that companies had to prepare for increased regulation and scrutiny, they were also becoming more sophisticated about how they were using data. They were experimenting with machine learning and artificial intelligence; they were building models that could easily go astray (with embarrassing repercussions) if they were based on data that was out of date or erroneous. And they realized that their data science teams were spending an inordinate amount of time searching for data, which was frequently locked up in a departmental silo or submerged in a data swamp. They often compounded the time spent searching when they realized, after starting their analysis, the data they found was unusable. It was stale, incorrect, incomplete, badly described, or subject to any of a dizzying number of problems. If your data isn’t trustworthy, the results you get from that data won’t be trustworthy, either.
What did the data scientists need? Tools to help them find relevant data, understand its schema, understand how it was collected, understand how and where it was used and whether they could trust it. What did the compliance experts need? Tools to help them find relevant data, understand its schema so they knew just what was included in the data, understand how the data was collected, how and where it was used, and whether they could trust it. Pretty soon, people realized that these were almost exactly the same problem.
The problem boils down to managing metadata—the data about the data, the data that describes the data. Companies need to manage their metadata so they know what their data means: how it was collected, how the data is represented, how the columns in a table are defined, when the data was updated, and even how frequently it is accessed. Data that hasn’t been used for a few years probably hasn’t been used for a good reason. Companies also need to track restrictions on data’s use, who is allowed to access it, who is allowed to change it, and much more. Datasheets for Datasets describes some of the metadata that has to be tracked to manage data effectively. Managing this metadata has often been handled by a “data steward”; but as data scales, delegating metadata management to a single person becomes ineffective. It’s impossible to keep up with all the data flowing into an organization—even a small one.
And that’s what makes data governance interesting. It’s not just a requirement that’s imposed by external regulators. It’s about the process of understanding what data you have, what that data means, and how to use it. And it’s surprising (well, not to any data scientist) how few companies actually understand the data they have, and what they can do with it. And once you understand your data—what you have, what it means, where it came from—you’re finally in a position to use it effectively.
How does this look in practice? The open source project Amundsen was started to enable data discovery at Lyft. It enables data scientists to search for data in Lyft’s “data lake,” and implements something like Google’s PageRank algorithm to rank relevant data sources. It also tracks who has accessed the data, how often, and when; data that is used frequently is more likely to be well-maintained. Although Amundsen was built to solve a supposedly different problem, it has also become a tool for data governance. It’s really about metadata management, and that’s at the heart of data governance.
It’s also important to think about what metadata management tools like Amundsen don’t provide. Amundsen tracks data access, so there’s a virtual “paper trail” about how data was used, but it doesn’t implement any kind of access control. It won’t prevent someone from accessing data they shouldn’t; it just lets you document what happened after the fact. It’s better at tracking down a violation than preventing one. It also doesn’t track data lineage (at least, not yet), although users can add metadata about how data is modified and remixed. So it’s not a complete solution—but it’s a step toward a solution.
Going beyond metadata management, many data governance platforms are designed to enforce data access policies. They go beyond leaving a “paper trail” by restricting data access to those who have appropriate credentials, even on a record-by-record basis. They can also track data lineage, build data catalogs, and search for relevant data. Most of the commercial tools provide explicit support for regulatory compliance, such as GDPR and CPPA.
Regardless of the tools, data governance and data discovery go together. You can’t use your data if you can’t find it. You can’t use your data if you don’t even know what data you have. And you’re still at risk of data breaches, legal liability, and violating customer’s trust, even if—especially if—you don’t know what data you have. Data governance starts with metadata. And once you understand that, you understand that by requiring you to manage your metadata, data governance is an enabler, not a hindrance. That’s when you can really think productively about how to use your data