Sunday, October 28, 2012

Reversing Malware Protocols With Machine Learning

Last week took place the 5th ACM Workshop on Artificial Intelligence and Security,  co-located with the 19th Conference on Computer and Communications Security in Raleigh, NC. This workshop is one of the few focused exclusively on the application of machine learning and AI based methods to privacy and computer security problems, so it was a great event to introduce our latest research, which I have done together with Tammo Krueger, Nicole Krämer and Konrad Rieck. We had the opportunity to present our method to automatically build stateful models of network protocols using machine learning.

Reverse engineering network protocols has been a popular strategy for people wanting to develop open implementations of proprietary protocols. For security researchers it is an effective way to understand how malware that is used to control infected machines in a botnet communicates with its command and control server. Sometimes, these messages are exchanged using communication channels built on top of known protocols like IRC, HTTP or P2P (some of them very quirky, as encrypted blog posts or steganographic image uploads). If the infected machine is a mobile phone is not uncommon for malware to receive and sent instructions over SMS messages.

As the complex time-consuming task it is, there have been many efforts to automate this reversing process. Some of them have focused on how to extract the complete protocol specification, which is very effective if done through taint analysis but not possible relaying only on network traces. Others have focused on understanding the protocols enough to simulate vulnerable services in honeypots (i.e. ScriptGen). Such honeypots can automatically infer parts of a protocol from network traffic but they have not been designed to track more evolved attacks that require a longer sequence of stateful communication. Close to this line of research, we have designed a probabilistic approach to model both the message content and the state machine of the protocol relying only on network traffic. Our method, called PRISMA: Protocol Inference and State Machine Analysis, is able to learn a stateful model that can be used to simulate valid communication of an unknown protocol. To construct this model, our method infers the message format, the state machine as well as rules for propagating information between states using machine learning techniques.

PRISMA builds on special tailored embedding and clustering techniques that allow for large-scale applications with millions of network messages. After preprocessing the network traces and embedding the messages, we define a similarity measure in order to find common structures in the data. We have explored part-based and position-based clustering through non-negative matrix factorization (NMF) to group individual messages into events, but other options for clustering algorithms can be chosen, as long as the procedure assigns a cluster label to each message.

Messages which occur at specific events in the flow of communication often exhibit similar structural features. Thus, to extract event information we exploit this structural dependency. Each of the extracted sequence of events can be seen as a path through the protocol’s state machine. To infer an approximation of this state machine, we use Markov models, where transitions are associated with probabilities. Every state in the model is linked to a set of automatically generated templates of the messages observed at a certain position in a sequence. The information flow between the different states during a communication is also automatically inferred and characterized as a set of rules. Finally, we have developed a network level module called LENS, which is able to load the inferred model and simulate both sides of a real communication session.

After evaluating the system with real protocols, we have used our method on network traffic collected from malicious software. The following figure shows an example of the state machine inferred from traffic of the Koobface worm:

This method is specially interesting for honeypots, as simulating malware communication can help to deceive an attacker and obtain more information about its behavior. By inspecting the extracted state-machine and the associated templates and rules a malware analyst can also gain insights into the inner workings of a sample from the collected network traces alone. This also makes PRISMA a valuable method for dynamic analysis of malware beyond honeypot applications.

Attacks like call fraud and identity theft often involve sophisticated, stateful attack patterns which on top of normal communication try to harm systems on a higher semantic level than usual attack scenarios. To detect these kind of threats via specially deployed honeypots, at least a minimal understanding of the inherent state machine of a specific service is needed to lure potential attackers and to keep a communication for a sufficiently large number of steps. To this end we propose PRISMA, a method for protocol inspection and state machine analysis, which infers a functional state machine and message format of a protocol from network traffic alone. We apply our method to three real-life network traces ranging from 10.000 up to 2 million messages of both binary and textual protocols. We show that PRISMA is capable of simulating complete and correct sessions based on the learned models. A use case on malware traffic reveals the different states of the execution, rendering PRISMA a valuable tool for malware analysis.

Tammo Krueger, Hugo Gascon, Nicole Krämer and Konrad Rieck
ACM Workshop on Security and Artificial Intelligence (AISEC) October 2012