50 Billion Triples: Digging a Bit Deeper

In the last two installments of this series (part 1 and part 2), we discussed some higher-level thoughts on striking a balance between easy ingest and more prep work as well as some initial queries to get a sense of an unknown graph’s structure and other characteristics.

The queries that we have run to this point were intended to discover structure within our dataset at a macro level. The algorithm that we will run now requires us to consider the dimensionality of our graph. In other words, using algorithms such as centrality or community detection on the entire graph without context is meaningless; we need to run these algorithms on subsets of the data.

Prior to delving into the following queries, a quick note: All of the algorithms used are built into the Cray Graph Engine (CGE) using syntax that deviates from the SPARQL 1.1 query specification. This means that the following queries will only work using CGE. Most of these algorithms can, however, be implemented as iterative algorithms using a combination of “vanilla” SPARQL 1.1 and your programming language of choice such as Java, Perl, Python or a Bash script. An example of such an approach can be found here.

Community detection can be done using many different algorithms. For the purposes of this work we will be using a label propagation algorithm that is built into the CGE.  Given the data at hand, which is network traffic, we will be considering a few of the possible relationship types that exist for network flows.

First, we must decide which entities we want to place into clusters: source IP addresses seem appropriate for this example. From our previous queries, we found that network flow events had several properties such as a source and destination port, timestamp, protocol, number of packets and volume of data. For this exercise, let’s consider destination ports, protocols and destination IP addresses. This particular implementation of label propagation allows for weights to be assigned, so we can exploit that to influence the outcome by biasing the clusters toward the destination IP addresses followed by the protocol.

So here is how we might approach this:

PREFIX cray: <http://cray.com/>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

 

SELECT ?ip ?community

WHERE {

CONSTRUCT {

?sip1 ?weight ?sip2

}

WHERE

{

{

SELECT ?sip1 ?weight ?sip2

WHERE {

{

SELECT DISTINCT ?sip1 ?dip

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip1 ;

<urn:p/dstaddr> ?dip .

}

}

{

SELECT DISTINCT ?sip2 ?dip

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip2 ;

<urn:p/dstaddr> ?dip .

}

}

FILTER(?sip1 != ?sip2)

BIND(“3″^^xsd:integer AS ?weight) .

}

}

UNION

{

SELECT ?sip1 ?weight ?sip2

WHERE {

{

SELECT DISTINCT ?sip1 ?protocol

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip1 ;

<urn:p/dstaddr> ?dip ;

<urn:p/protocol> ?protocol .

}

}

{

SELECT DISTINCT ?sip2 ?protocol

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip2 ;

<urn:p/dstaddr> ?dip ;

<urn:p/protocol> ?protocol .

}

}

FILTER(?sip1 != ?sip2)

BIND(“2″^^xsd:integer AS ?weight) .

}

}

UNION

{

SELECT ?sip1 ?weight ?sip2

WHERE {

{

SELECT DISTINCT ?sip1 ?dport

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip1 ;

<urn:p/dstaddr> ?dip ;

<urn:p/dport> ?dport .

}

}

{

SELECT DISTINCT ?sip2 ?dport

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip2 ;

<urn:p/dstaddr> ?dip ;

<urn:p/dport> ?dport .

}

}

FILTER(?sip1 != ?sip2)

BIND(“1″^^xsd:integer AS ?weight) .

}

}

}

INVOKE cray:graphAlgorithm.community_detection_LP(2)

PRODUCING ?ip ?community

}

ORDER BY ?community

 

So, let’s break this down: First and foremost, even on a supercomputer large intermediate result sets can be problematic. The first thing we must do to avoid running out of memory is to break down the query into more manageable sub-queries, then we can take the UNION of each of those result sets to feed a list of IP addresses along with our previously determined properties of interest (e.g., IP communications, protocols and destination ports). So, the algorithm will assign an integer value to all closely related IP addresses based on the properties of interest.

Now, it’s all well and good to know which IP addresses are closely related, but it might be a bit more useful to see what properties are most common for each community. We can do this in two ways. The first option is to use the previous query to insert the community ID of each IP address back into the dataset and work from there, or we can wrap the previous query in a new one, which is inefficient because each time we want to run this query, we have to rediscover the communities. On the other hand, the latter method is more flexible because we can play with the community detection algorithm settings if we feel the results are not quite right. Another thing we will do is look for other similarities, so rather than only look at how the communities are similar based on the first list of properties we used, we will add additional properties so that we can discover what other similarities can be found.

PREFIX cray: <http://cray.com/>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

 

SELECT ?community ?predicate ?property (COUNT(?property) AS ?count)

WHERE {

{

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?ip ;

?predicate ?property .

FILTER(?predicate NOT IN (<urn:p/srcaddr>,<urn:p/start>,<urn:p/end>,<urn:p/packets>,<urn:p/sport>,<urn:p/duration>,<urn:p/tcp_flags>,rdf:type,<urn:p/bytes>)) .

}

UNION

{

?flow a <urn:ontology/event_netflow> ;

<urn:p/dstaddr> ?ip ;

?predicate ?property .

FILTER(?predicate NOT IN (<urn:p/srcaddr>,<urn:p/start>,<urn:p/end>,<urn:p/packets>,<urn:p/sport>,<urn:p/duration>,<urn:p/tcp_flags>,rdf:type,<urn:p/bytes>)) .

}

{

SELECT ?ip ?community

WHERE {

CONSTRUCT {

?sip1 ?weight ?sip2

}

WHERE

{

{

SELECT ?sip1 ?weight ?sip2

WHERE {

{

SELECT DISTINCT ?sip1 ?dip

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip1 ;

<urn:p/dstaddr> ?dip .

}

}

{

SELECT DISTINCT ?sip2 ?dip

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip2 ;

<urn:p/dstaddr> ?dip .

}

}

FILTER(?sip1 != ?sip2)

BIND(“3″^^xsd:integer AS ?weight) .

}

}

UNION

{

SELECT ?sip1 ?weight ?sip2

WHERE {

{

SELECT DISTINCT ?sip1 ?protocol

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip1 ;

<urn:p/dstaddr> ?dip ;

<urn:p/protocol> ?protocol .

}

}

{

SELECT DISTINCT ?sip2 ?protocol

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip2 ;

<urn:p/dstaddr> ?dip ;

<urn:p/protocol> ?protocol .

}

}

FILTER(?sip1 != ?sip2)

BIND(“2″^^xsd:integer AS ?weight) .

}

}

UNION

{

SELECT ?sip1 ?weight ?sip2

WHERE {

{

SELECT DISTINCT ?sip1 ?dport

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip1 ;

<urn:p/dstaddr> ?dip ;

<urn:p/dport> ?dport .

}

}

{

SELECT DISTINCT ?sip2 ?dport

WHERE {

?flow a <urn:ontology/event_netflow> ;

<urn:p/srcaddr> ?sip2 ;

<urn:p/dstaddr> ?dip ;

<urn:p/dport> ?dport .

}

}

FILTER(?sip1 != ?sip2)

BIND(“1″^^xsd:integer AS ?weight) .

}

}

}

INVOKE cray:graphAlgorithm.community_detection_LP(2)

PRODUCING ?ip ?community

}

}

}

GROUP BY ?community ?predicate ?property

HAVING (?count > 5)

ORDER BY ?community ?predicate DESC(?count)

 

The results of this query will now let us, for the first time thus far, claim to actually “discover” unknown or hidden relationships. What we did is use a few properties to put IP addresses into communities bases on those similarities. By looking at the remaining properties not considered in the initial community detection, we can actually find other similarities without needing to know to look for them in the first place — which, by any definition, constitutes discovery.

triples

For example, we can see that the three most common destination ports for community 1 are 80, 2000 and 53. Nothing horribly shocking there, but nonetheless, we didn’t have to look for it explicitly. In order to take full advantage of this, it might be useful to look at all the different similarities and follow each one independently, then compare them at a macro level in order to find more complex relationships.

One might respond to this line of exploration and say “Great, so now I can see that there are similarities between groups of IP addresses. So what?” The interesting thing about using community detection that considers multiple properties is that we are now dealing in approximations which are difficult to do with other techniques. Take, for example, K-means clustering, which is another way to separate individuals by similarity: As compared to the community detection algorithm we are using, K-means requires either prior knowledge of the dataset or a brute-force approach whereby all or at least the most likely range of values of K (e.g., the number of clusters) are evaluated. Another issue with using K-means clustering as opposed to community detection is that K-means, in its classic form, allows us to consider no more than three dimensions in a practical manner, whereby any number of dimensions can be considered for community detection. There are certainly other algorithms as well as variations on K-means that can make clustering of multidimensional data easier, but our approach to community detection is certainly a good way to do it.

So, we can cluster multidimensional data without a lot of prior knowledge of the dataset; that is, in and of itself, useful, but you may still ask “So what?” Well, if one’s goal is to work with unfamiliar datasets and find unknown relationships between entities (IP addresses, in our example), then, as we have demonstrated, we have stumbled onto a really good approach to doing that.

A practical application for community detection in the context of our example dataset might be to identify malicious traffic by identifying the combination of features that are most closely associated with malicious activity and using those findings to find other instances of malicious activity. This is a nontrivial task using other more common methods, but with community detection, it becomes significantly easier.

One concrete example I can remember from my work in a national-level security operations center was seeing a co-worker stumble on a connection between malicious domains found in WHOIS data. All of the malicious domains were algorithmically generated and defeated common detection mechanisms by increasing the time to live (TTL) values, using pronounceable domain names (formed from pseudo-randomly selected combinations of dictionary words), as well as a few other tricks. One commonality that my co-worker found was that each domain used variations of the same physical address in its registration (always the same street, but different house number). I always thought it was amazing that my co-worker had found that connection, but it was completely by chance. Not to suggest that my peer was not an extremely talented analyst who knew what datasets to look at, but nonetheless, the discovery was pure chance. I always wondered how I might make that type of discovery without rolling the proverbial dice. Using community would have allowed us to make that discovery very quickly and consistently because those types of similarities would naturally bubble up to the surface.

To sum up this blog series, what we have done is start with a dataset that was unfamiliar and found its structure as well as an initial idea of what type of relationships exist within the data. Once we had that information, we used it to make some useful discoveries within the data by invoking a community detection algorithm to find similarities between member of each discovered cluster. Most importantly, we demonstrated that we can, in fact, discover unknown relationships without having much prior knowledge of where to look. Finally, we made a connection to a real-world problem and proposed a reasonable solution to the problem of finding unknown relationships without relying on chance to find the solution.

So, back to Mr. T: “I pity the fool” that doesn’t at least consider how graph analytics might improve their existing analytics work. Hopefully I have demonstrated a few ways that graph analytics can provide practical answers to practical questions. There are many more ways to use graph analytics, and I encourage you to read more about them, using this as a starting point, rather than a final destination.

Speak Your Mind

Your email address will not be published. Required fields are marked *