Shining a Light on Dark Data

Green folder among white foldersData is exploding. The variety of data being created by workers inside and outside of the workplace and the velocity at which that data is being shared makes corporate compliance officers sleep with one eye open, because uncontrolled data equals unknown risk, and the unknown is scary. Think about it – in addition to the terabytes of data lurking in companies’ disparate systems, organizations today are creating new content that is expected to drive 60 percent growth in enterprise data stores (Worldwide Big Data Technology and Services 2012-2015 Forecast, Mar 2012, IDC).

Most corporate compliance officers are concerned with the latter – newly created data is the shiny object grabbing attention. However, equal focus needs to be placed on legacy data (sometimes known as dark data), which is often unknown, unmanaged, and may be out of compliance with internal or external requirements. Many organizations today are dealing with information sprawl by throwing more storage at the problem – accepting the risk as a cost of doing business – or by simply ignoring it. None are ideal measures to protect the organization. In fact, 31 percent of organizations report that poor electronic recordkeeping is causing problems with regulators and auditors (Information Governance- Records, Risks, and Retention in the Litigation Age. AIIM 2013).  Further, the cost of an individual data breach costs organizations an average $5.5 million (2011 Cost of Data Breach Study: Ponemon Institute 2011).  There are also countless examples of fines, sanctions or adverse inference decisions being triggered by data being accidentally lost or mishandled.

To get a handle on dark data, it is first important to understand what it is. Dark data can take many forms, including both structured data (machine-created information that typically fits in rows and columns) and unstructured data (human-generated information that is much more difficult to search). It can also come in many formats and reside in many places, making it more difficult to access. It can be amassed simply because of our reliance on cheap storage or because of special circumstances like M&A. In virtually all cases, legacy data poses legal, regulatory and internal risk if it isn’t managed effectively.

Technology Provides the Light

Fortunately, legacy data doesn’t have to be left in the dark. There are technology solutions now available to help organizations access, understand, control and take action on this data. Typically, organizations choose to focus first on either legacy unstructured data or legacy structured data, as each has slightly different characteristics.

Legacy unstructured information often rests in email repositories, file shares and SharePoint sites. Technology can be applied to access this information to identify redundant, obsolete and trivial data that may be a candidate for defensible disposition, and then migrate the remainder — that which an organization wants to retain and manage — into an active repository like an archiving or records management system so this legacy unstructured data is managed with the same policies and rigor as its live data. This allows an organization to minimize its storage footprint, consolidate information for more efficient and cost-effective investigations and eDiscovery, simplify the legal hold process and make data more readily available for collaboration and knowledge management.

Legacy structured information often rests in production databases or legacy applications that are no longer being used. The inactive data in both places can create significant problems for the organization, including diminished application performance and backup times, as more data is added to the queue. Technology can be applied to migrate this information to active repositories where policies can be applied, which can serve as the backbone of a disposition process. This allows organizations to achieve many of the same benefits described above for unstructured information, plus improve performance and enable the retirement of certain applications that are being kept alive and spinning (yielding superfluous cost) simply in case data is required for a legal or regulatory matter.

In both cases, getting control of dark data serves as a pathway to information governance, which can take place in a linear process (after ROI has been proven on the first) or in a parallel configuration. As illustrated, they also can be combined with other information governance activities for an end-to-end solution in which the organization strategically governs all enterprise data in a holistic manner.

Regardless of what path you choose to control dark data, keep these tips in mind:

  • Cast a wide net – Today, information that is subject to regulations, eDiscovery or legal holds is broad-based. Be cautious of focusing just on one data type (e.g., email) to avoid unexpected “red alerts.”
  • Don’t do it manually – Information is growing so fast, causing data stores to get increasingly out of hand. Relying just on manual processes, in which a human determines the value of each object, simply doesn’t make business sense. Look for a technology that provides a pathway to apply automated policies to data in order to optimize efficiency.
  • Think long term – Often, information governance is used synonymously with data disposition.  While it is true that deleting some portion of the data is an objective for getting control of dark data, it’s generally not the only objective. By consolidating the remaining and valuable data in an active repository, organizations can more efficiently search and leverage this data over the long term – and also ensure that this data remains accessible even as technology changes.
  • Focus on defensibility – Keep audit trails on what decisions were made that impacted how data was managed. This will keep organizations from trying to recreate these impacts later on and help ensure data is protected if practices are questioned by the courts or regulatory bodies.

HP Autonomy’s ControlPoint and Application Information Optimizer (AIO) technologies are solid choices for shining a light on both unstructured and structured data. Both leverage HP’s IDOL technology to access and understand data without bias to data type, repository or language. Both also can help automate disposition and migration to active repositories – including HP Autonomy’s market-leading archiving and records management systems. And both deliver audit trails for maximum defensibility.

I’ve heard it over and over: Dark data is not an issue for my organization. My perspective is that dark data is an issue for EVERY organization. There is data lurking everywhere in the organization, and rather than give it the power to ruin your night’s sleep – or worse, have serious legal or regulatory implications – take back your power and control with technology.

About the Author

Joe Garber

About the Author
Joe Garber is Vice President of Information Governance at HP Autonomy. In this role, he leads product messaging and go-to-market efforts for the organization’s eDiscovery, information archiving, and ECM market offerings.

Garber has more than 10 years of experience in information governance and eDiscovery. He most recently served as Vice President of Marketing for RenewData where he managed all product and corporate marketing efforts for this eDiscovery service provider. He also previously served as Director of Market Strategy for ZANTAZ (and subsequently Autonomy ZANTAZ) where he led analyst relations, field marketing, market analysis, and thought leadership programs.

During his 20-year career, Garber has also served as a management consultant for IBM, led marketing and product management for a variety of successful technology startups, and served as a press secretary for a U.S. Senator. He holds a Bachelor of Arts degree from Pepperdine University and a Master's of Business Administration (MBA) from Cornell University where he was named a “Park Leadership Fellow.”