Have you ever worked an exercise for a class, or studied a topic, only to pick up one little trick or technique that made the entire activity worth it? In one way, I guess it could be considered as missing the point, but any time you wind up taking something useful away from an activity, I think it’s worth it.
Index of Coincidence is one of those things for me. In a cryptography class, we were tasked with cracking a message encrypted in a Vigenere cipher. A Vigenere cipher is implemented as a series of Caesar ciphers. A Caesar cipher is your typical grade-school shifting of letters (A becomes C, B becomes D, etc.). With Vigenere, rather than having one shift (2 in the previous sentence’s example), we have a series of shifts that we cycle through. The first step of breaking a Vigenere cipher is to figure out how many shifts have been implemented. A property of text known as “index of coincidence” is used to determine this.
And it’s so useful for other things! The idea is, take your data, be it alphabetic characters, bytes, whatever, and shift it by a certain amount (wrapping back around from the tail to the head). Then, compare it with the original, count how many times the two match, and figure it up as a percentage of the entire set of data. For example:
Take your data be it alphabetic characters, bytes, whatever, and shift it data be it alphabetic characters, bytes, whatever, and shift it Take your * * * * * ** * 8/73 = 0.109589041
So in this example, the Index of Coincidence (IOC) is about 11 percent. Typically, English text has an IOC of about 6.69 percent, although this is over a larger chunk of text (smaller examples like this will be all over the map). Other languages have distinctive IOC’s as well.
What’s really cool about this, is that it’s a good “rule of thumb” measure of how random or uniform a set of data is. Random data should have very low IOC, while something completely uniform will tend to be closer to 1.0, or 100%. Why is this cool? Good encryption should result in cyphertext that is almost indiscernable from random data. Good compression algorithms should also recognize patterns in data and result in very non-uniform data. So when we’re doing analysis of a piece of malware, for example, if we suspect that the sample is encrypted or compressed in some way, we can test that hypothesis.
This is actually a lot more accurate than “eyeballing” it. Take a hex editor to the “grep” binary and unless you understand machine code, it’s going to look pretty incomprehensible, and if you didn’t know better, random. Let’s compare it with real random data, using a script I’m about to show you:
$ ls -al grep -rwxr-xr-x 1 weasel weasel 96176 2007-03-01 19:54 grep $ dd if=/dev/urandom of=rand bs=96176 count=1 1+0 records in 1+0 records out 96176 bytes (96 kB) copied, 0.116349 seconds, 827 kB/s $ ./ioc.py grep 5 0.0709116619531 $ ./ioc.py rand 5 0.00363916153718
So here (shifting by five) we see that the grep binary has an IOC of about 7 percent, while the chunk of random data we generated has an IOC of about 0.3 percent. What happens if we gzip the grep binary to compress it? :
$ gzip grep $ ls -al grep.gz -rwxr-xr-x 1 weasel weasel 47446 2007-03-01 19:54 grep.gz $ ./ioc.py grep.gz 5 0.00503730556844
Wow! Down to 0.5 percent! How can we use this?
- Reverse engineering binary data formats – See if we’re going to have to go through a layer of compression
- Is this piece of malware encrypted and using a loader?
- Looking at a dump of packet data payloads to see if a protocol is likely encrypted or compressed in some way
- Identifying plain English text (and other languages!) in an automated fashion!
Some things to think about and play with:
- Would it be useful to apply this to opcodes in executable binaries?
- Could you identify potential areas of interest for data carving out of huge image files in forensics/data recovery?
- Build a database of common file types and their usual IOC?
So here’s some code. Written in Python, like most of what I put together for quick and dirty situations like this. You can run it from the shell, passing a filename and a number to shift by (in most cases, it shouldn’t really matter what number you pick), and it’ll return the IOC as a floating point number between 0 and 1. You can also import it into your own Python script, or the interactive Ipython shell and pass data to the calculate_ioc function directly. The main caveat is to remember that this is reading the entire file into memory in one shot, so you might want to roll your own code for this if you’re doing it on something huge.
#!/usr/bin/env python import sys def calculate_ioc(data, shift=1): match = 0 for i in range(0, len(data)): j = (i + shift) % len(data) if data[i] == data[j]: match += 1 ioc = (float(match)/float(len(data))) return ioc if __name__ == "__main__": if len(sys.argv) < 2 or len(sys.argv) > 3: print 'Usage: ' + sys.argv + '
[shift]' sys.exit(1) try: fp = open(sys.argv,'r') except: print 'Could not open ' + sys.argv sys.exit(1) data = fp.read() fp.close() if len(sys.argv) == 3: shift = int(sys.argv) else: shift = 1 print calculate_ioc(data, shift)
Enjoy! It’s also available here for your right-click-saving pleasure.
For further reference, the wikipedia page for Index of Coincidence should be useful, if it hasn’t been taken over by roving bands of cryptology trolls, as well as a little googling.