Data Theft Prevention: Why Data Science Is Not Enough

demo

Only unskilled thieves use the front door.

According to a report released by A Secure Life last month, 66% of burglars use something other than the front door when robbing a home.  They sneak in, steal as much as possible, and then leave quickly (within 8-12 minutes) to avoid getting caught.  The longer they’re in a house or the more attention they bring to themselves, the more likely they are to have the police called on them. This is common sense, but you may be asking: why is it important?

As it turns out, stealing seems to have quite a few parallels between the analog and digital worlds.

A cyber thief will take a very similar approach to your everyday home burglar. Using the front door, or Port 80 as its known in the cyber world, would expose them to all sorts of watching eyes. Their best chance for stealing data without getting caught is to move it through a window that nobody is watching.

According to Dr. Rhiannon Weaver of Carnegie Mellon University in her paper on anomalous port-specific network behavior, those windows are ports 1025 to 65,535. These are unreserved ports that do not have fixed purposes, making them generally exempt during the creation of firewall rules. Because of their ability to evade firewall systems, they’re often used with malicious intent by peer-to-peer file sharing or instant messaging services as a means to steal confidential data.

Knowing this is happening is one thing, but stopping it is quite another challenge. Even using some of the best data science tools available for classification and clustering (again outlined by Dr. Weaver), all you can do is identify outliers and potential points of anomaly—this is not the same as confirming any malicious intent.  It is just giving you certain pieces to look at. What you really need to identify malicious behavior is a bridge between the outliers and their intent.

As a human expert, I would create this bridge by taking the events that the data science outputs and researching each of them using Google, Bing, and other tools available. Using that learned information, I would then postulate how likely it is that each event was malicious. The problem with this is that it’s time consuming, and there’s no guarantee that my expertise and intuition have led me in the right direction.

This is where the advent of natural language processing (NLP) machine learning algorithms can help significantly. What it provides is the ability to automate the human research previously required, and it does so in a fashion that never gets tired and can operate consistently.  By training on web searches for millions of threat types, NLP algorithms are capable of quickly scanning through thousands of internet pages to identify if a particular item of interest…say an IP address and User Agent combination… could be associated with known threatening behavior.  Even better, based on the feature set generated by this search, a smart system can even classify exactly what types of threats particular anomalous patterns are associated with.

Using NLP to augment traditional data science, it is possible for artificial intelligence algorithms to truly adopt the time-consuming responsibilities traditionally reserved for highly-specialized human experts.

As I called out last post, using traditional anti-virus is no longer enough to guarantee security.  In fact, according to this article recently published by PC World, traditional anti-virus can actually be counter-productive.

Latest blogs

SparkCognition is committed to compliance with applicable privacy laws, including GDPR, and we provide related assurances in our contractual commitments. Click here to review our Cookie & Privacy Policy.