Main Contents

Uberdatabase dreams and the harsh reality

Digital Forensics and Security

After a couple of insane weeks filled with a Progression Report to be delivered, loads of lecturing work and a nice bout of flu that has me sitting somewhat sleepless on the keyboard this lovely Friday morning, GrayHat Forensics is back.

Today’s topic? Uberdatabases: State/Country-wide government databases storing phonecall-related information (and conversations), texts, e-mail, IM messages, web browsing histories etc to protect their citizens from the threat of the T-word and all those other lesser evils which governments believe they can cure out of society through monitoring and witch-hunting.

Recently, a study by the National Research Council in the US (http://news.cnet.com/8301-13578_3-10059987-38.html?part=rss&subj=news&tag=2547-1_3-0-20 and http://www.theregister.co.uk/2008/10/08/us_gov_data_mining_report/) essentially stated what a large number of people in the NetSec community have been saying for a number of years: Pattern-matching/data mining to spot T-word people will not really work.

This comes at the same time that the UK government is making yet another big push for the full-scale interception of pretty much everything in the way of communication and storing it in an uberdatabase, as reported in TheRegister (http://www.theregister.co.uk/2008/10/07/detica_interception_modernisation/), after deciding it will not do it (http://www.theregister.co.uk/2008/09/25/interception_modernisation_bill/) after all.

The problem? The rate of false-positives and false-negatives, otherwise known (in statistics) as Type I and Type II errors. This wikipedia article (http://en.wikipedia.org/wiki/Type_I_and_type_II_errors) explains the whole concept rather nicely, and has a most informative research literature backing it.

Essentially: Type I errors refer to the improper rejection of the null hypothesis, and Type II errors refer to the improper acceptance of the null hypothesis (http://en.wikipedia.org/wiki/Null_hypothesis).

So, then, trying not to turn this into either a statistics lecture or a conference paper while at the same time trying to pass on the gist of the whole process, in order to create a model that will fit the data in a reasonable fashion and will then be able to infer relationships and forecast future trends based on past experience (data), we need to understand the subject, formulate a sound and reasonable null hypothesis, and then sit down, study the data and the subject area and remove all non-explanatory variables (the variables which do not help us explain what we see), which will leave us with a simplified model which SHOULD (but might not, in which case its “back to the drawing board”) fit the data to some degree.

Problem in this instance is: What on earth is the null hypothesis? And what can we consider to be the explanatory variables in the model that we cannot as yet have because we don’t know the null hypothesis OR the subject area?

Oh, I’ll grant you, there’s any number of “experts” in the subject area we are considering, and a search in Amazon will give us a distressingly large number of “books” on the subject area…All of which are a rather complete and utter waste of time and money for those of us who have been educated and trained to sit down and work with data.

In his paper on the base-rate fallacy and the difficulty of intrusion detection (http://portal.acm.org/citation.cfm?id=357849), Axelsson S. compares anomaly and signature detection Intrusion Detection Systems against a Bayesian base-rate fallacy model which models the rate of occurence of false positives/negatives. While his conclusions deal with Intrusion Detection Systems, his Bayesian base-race fallacy model is a more-than-acceptable standard through which we can determine the effectiveness of an Intrusion Detection System, and the standards this paper sets can and should be adaptable to the current problem with regards to data mining and pattern analysis, and it can and should form the basis of a standard against which all these different “solutions” that aim to find the perfect model of what consists a t-word activity should be like.

But to do that we need to properly and SCIENTIFICALLY (and I MEAN mathematically and statistically) define the subject area!!! And the problem here, of course, is that we really really cannot do that, because the subject area is by far too abstract to be defined.

Result: These uberdatabases will do pretty much nothing to detect occurences related to the subject area. Ahhh, but they (these uberdatabases) CAN & WILL find other uses, extending the surveillance capabilities to include any number of extraneous things and thus do other interesting things.

Any network administrator of moderate intellect knows that putting all of your eggs in one basket just creates a single point of failure. So, not only will these UK and US Uberdatabases will do pretty much nothing to infer and forecast T-word occurences, but they would be so badly succeptible to abuse, and so invitingly a target for the real bad-guys to get at.

Note: All those with a maths and stats background, please forgive my oversimplified explanations of Type I & II problems and the statistical analysis process, but I found no better way of reducing the complexity of the subject area to allow the public to understand the whole concept without turning this into a conference paper.

DarkSYN @ October 10, 2008

Sorry, the comment form is closed at this time.


Feed