Data Retention In A Post-Breach World



More often than I care to count I've heard the phrase "Disk is cheap". It is invoked in an attempt to shorten an informal cost/benefit or risk/reward analysis related to the storage of data in large systems. Often the speaker is well-intentioned and trying to absolve the team of the need to complete the analysis in the face of what appears to be a simple trade-off. I think this is misguided.

In this article, I’ll argue that data can be and often is just as much of a liability as it is an asset. If you consider only the CAPEX (Capital Expenditure) of purchasing the physical disks, they are in fact relatively cheap. When you consider all of the factors, there are many data-sets that easily pass muster and are clearly worthy of longer-term storage. Other data-sets show just how quickly their intrinsic value falls below the costs of operations, or becomes a liability.

Costs continue to grow while data is at rest

The CAPEX costs of long-term storage include more than just the disks. You need to account for physical and/or virtual servers, SAN components or RAID controllers. There are also costs attributable to high-availability such as a redundant set of equipment in another data-center. Let's not forget about security:what equipment will be required to protect this data?

Adding the OPEX component to the story only makes things harder. You have more line-items to consider like rack-space, power, climate control, operations and support staff, hardware maintenance and software support. Also be sure to add bandwidth or cloud service. Now multiply these costs by the number of months or years it is expected to be in service. Plans for high-availability systems will account for replacement parts over time, systems with lower availability might plan for possible data-recovery costs, if a drive were to fail.

Out of an abundance of caution, we often find that one set of data has been copied to another location. Possibly this is done as a quick backup before an upgrade step, possibly as part of a migration between servers that was never fully cleaned-up. Either way you will often find copies of the duplicate data distributed around an IT operation. Which copies should be backed-up? Which copies can be deleted? Questions like these burn lots of IT resource while adding precious little value while needlessly increasing risk.

Storing Sensitive Data: Weighing Risks vs. Value

Like it or not, we live in a post-breach world. Losing sleep over being compromised is less effective than thinking about how to detect attacks, discover what the attackers already have, planning how to stop the leak, and minimize damage. Most organizations have a requirement to retain data for some period of time. Be this an operational or regulatory requirement, it is usually possible to project how long a given file will continue to have value.

Depending on the data in question, both the value to the enterprise and the value to an attacker varies. For sensitive data, requirements for protection and eventual destruction are often covered in a data retention plan. I tend to worry more about the lower value data that is often not covered by such a plan, and sometimes carries just as much value to an attacker. I have seen data like this retained until the disks that hold it fail, triggering an emergent response to triage and recover data that may or may not have significant value in the first place.

We can use web-server access logs as an example. How long does it make sense to store them? They can have value as a debugging tool, for measuring the impact of sales or PR activity, for detecting and tracing attacks. Transactional logs should likely be kept longer than logs from static servers, certainly long enough for all associated transactions to clear and beyond any applicable auditing timeframes. One strategy I use is to favor summarized data over raw logs. Instead of summarily deleting archival weblogs, run them through a log aggregation tool and archive the aggregated output. Consider summarizing data at the time of collection and scheduling deletion of the raw material at an appropriate time in the future.

Somewhere in this process indexing becomes a concern. How does one keep track of what they have? I’m sure I still have the first five line Perl script I wrote way back in 1994. I can’t tell you what it does, but I bet if I check enough tapes and old disks I could find it. But how does the search compare to the time required to recreate the script from scratch? A detailed catalog will help monitor age and access frequency of data in your collection while also enabling the detection of duplicate data.

One more thing to consider is PII and discovery. Any data that is stored can be subpoenaed, leaked or subjected to discovery by parties that may or may not be looking out for your best interests. Having a clear policy on what gets kept and when things get deleted will reduce the overall exposure of your data available to actors who may or may not be authorized to access it.

Security Intelligence: It’s All About Context

Then we have time-sensitive data. Security Intelligence is all about context. IP address "X" performs action "Y" at time "Z". If the activity happened today, there is value. As the record ages, its value as an atomic entity decreases. How long will that data remain fresh or valid? Some Internet providers rotate IP assignments daily, so data referring to dynamic IPs will age quickly. Compromised hosts get remediated, the same applies there. Would you pull either off of a disk five years, two years or even six months later and begin blocking traffic to the addresses listed? I suspect not. Depending on your application, stale IP data may or may not be useful for correlation down the road. Think carefully about at what point you should start thinning out your archive, and maybe replace the full details with summary data as records age and the finer details loose accuracy.

At Farsight, streaming is our preferred model of operation. In the vast majority of cases, data flows into a chain of loosely coupled processes, where we perform some analysis. The output of that analysis may generate new flows of filtered and summarized data. Some of these flows may be archived for a period of time, or populate a database or index. In most cases these streams are a "use it or lose it" operation. Moving from a model where you keep everything to one where you only keep what you know you can use can be a hard transition for each new team member.

So if you have the resources and the inclination to "store all the things", knock yourself out. Who am I to tell you not to? However, in my opinion, the really elegant solution to these big, Internet scale, problems is to think early and often about how long any piece of data maintains its value and usefulness.

Ben April is the Director of Engineering for Farsight Security, Inc.