Arun Tejasvi Chaganty about articles blog thoughts research

Mergesort. Dissected.

Note: All code in this under the WTFPL1. Except possibly for Stepane Delcroix’s adapted code, which is under the MIT/X11 license

Very recently, I decided to spend some time coding those standard algorithms that every programmer should know, but I am woefully ignorant of. The plan was to do an algorithm a day. Unfortunately, yesterday night, I started on the Merge sort algorithm, and I found some ancient bash script I wrote to profile and graph the running time of some other sorts (i.e. the bubble sort). Which led to me spending precious time playing around with gnuplot and various program parameters. Expect graphs. Lots of them.

Some background: I’ve lately developed a foreboding sense of being a ‘code monkey’. While I know how to create a UI using GTK+ (and wx), and how to link up stuff using DBus, I still don’t know a fig about QuickSort (which is next on my list). Being a CS student, this is in my view a crime. So I pulled out the big ’ol Cormen, Rivest, et al.2 and started to peruse it (btw, I still haven’t done my algos class). It took me very little time to get a basic implementation of the merge sort ready.

I was about to get back to my GSoC work 3, when I remembered I had written a bash script I wrote some time last November in my introductory CS lab session (of my own volition). At the time, having already programmed for a while, I managed to complete all my lab sessions very quickly, and wasted my time in the lab doing non-sensical things to my C programs, this being one of them (I still fondly remember a CS lab teaching assistant telling me about heap sort. Unfortunately, I didn’t get it at the time). Nonetheless, I was surprised when I found this script, since I honestly don’t know that much scripting now.

#!/bin/bash 
# Compile over different optimizations 
for i in `seq 0 3`; do 
  make CFLAGS="-g -O$i" LDFLAGS="-pg"; 
  echo "#mainO$i SORT DATA" > mainO$i.dat; 
  echo "#No. of elements     Time taken (ms)" >> mainO$i.dat; 
  for j in `seq 1000 1000 50000`; do 
    ./main $j; 
    gprof -b -a main > mainO$i.prof; 
    t=$(awk '/sort/ {print $2}' mainO$i.prof | head -n 1); 
    echo "$j        $t" >> mainO$i.dat; 
    echo "$j        $t"; 
  done 
  echo "mainO$j"; 
  make clean; 
done 
gnuplot tmpl.p;

The script has been subtly modified to make it more robust and cleaner.

Modifying the script required me to go over all of those bash basics I keep forgetting, and then doing the same for gnuplot. I initially made the merge sort work only for arrays of size 2n, and had a some minor complications extending it to an arbitrary sized array. I don’t know about the optimality of my method, but I essentially split every array into two parts, one that’s 2n sized, it’s remainder, recursively. This ends when the size of the array is 1, which happens always. So I needed a method to find the nearest power of 2.

Luckily, Jeffery Steadfast had already written something about this4. When I first read it, I made some absolutely baseless and stupid comment about how his method works (Method #2 that is). It’s far simpler than I thought. To redeem myself however, I figured out how the very efficient method suggested by Stephane Delcroix works (Method #4). It is by far the most beautiful code snippet I’ve seen, doing a sort of a binary search for the top most bit set. Here is the code:

long nearest_pow (long num) 
{ 
  long i, j; 
  (i = num & 0xFFFF0000) || (i = num); 
  (j = i & 0xFF00FF00) || (j = i); 
  (i = j & 0xF0F0F0F0) || (i = j); 
  (j = i & 0xCCCCCCCC) || (j = i); 
  (i = j & 0xAAAAAAAA) || (i = j); 
  return i; 
}

There are 4 bytes to an int, so 8 hex numbers. Each step divides the into into 2. For example in the first step, it checks if there exists a bit set in the first 2 bytes (4 hex numbers), using & FFFF0000. If so, the last 2 bytes are set to 0. In the other case, the first 2 bytes are anyway 0. Now, you only need to focus on 2 bytes. The mask & FF00FF00 works on both the upper 2 bytes and the lower two bytes at the same time. In effect, you’re applying the mask FF00 on the 2 bytes filtered from the first step. This continues to next step as well, with F0F0F0F0 being F0 in for the single byte you’ve filtered (remember, F is 1/2 a byte long). A in binary is 11001100, and C is 10101010. So at the last step, the uppermost bit survives, and all the lower ones are set to 0. And that’s the nearest power of two (on the lower end).

Back to merge sort. Here are some graphs. (Programs compiled with (-g -O{0,1,2,3} -pg), I’ve iterated from 1e6 to 5e7 in divisions of 1e6. That would be 50 data points. Attempts to profile stuff below about 1e5 fail, since it’s too short. I chose 1e6 because I needed to strike a balance between torturing my computer and my patience.):

// Note, `list_` is a scratch buffer. It's been allocated the same memory as
// list
void sort (long len, int *list, int *list_) 
{ 
  long i = 0, idx1 = 0, idx2, idx;
  idx2 = idx = prev_pow(len); 
  // divide 
  if (len != 1) 
  { 
    sort (idx, list, list_); 
    sort (len - idx, list + idx, list_); 
  } 
  else return; 

  // merge 
  while ((idx1 < idx) && (idx2 < len)) 
    list_[i++] = (list[idx1] < list[idx2])?list[idx1++]:list[idx2++]; 
  if (idx1 < idx)
    memcpy (list_+i, list+idx1, sizeof(int)*(idx - idx1)); 
  else 
    memcpy (list_+i, list+idx2, sizeof(int)*(len - idx2)); 
  memcpy (list, list_, sizeof(int)*(len)); 
}
Mergesort Vanilla
Mergesort Vanilla

At first I thought the spikeiness of the graphs was caused due to load variations on my system, but after looking at how smooth the other graphs were, I don’t know. Perhaps it’s caused by variation due to sensitivity of the initial array (but technically speaking, merge sort is tightly bound on \(O(n * \log n)\), and there shouldn’t be that much variation…). Does anyone have an insight on what might be the matter?

A very popular modification of the merge-sort algorithm is to use an insertion sort for branches smaller than a particular cutoff 5. I, of course, wanted to see whether there was a relation between input size and optimal cutoff. However, on analysis, I found that the cutoff did not help in any way (Note: the cutoff=2 case (i.e. it doesn’t insert sort much) actually takes longer than the vanilla mergesort). Perhaps it is something wrong in my insertion sort implementation, though I don’t see anything that could go wrong (after all, insertion sort is very trivial). All graphs made using the program compiled with (-g -O2 -pg), given increasing cutoffs at fixed array sizes.

//... 
idx2 = idx = prev_pow(len); 
/* insertion-sort */ 
if (len < cutoff) { 
  for (idx1 = 0; idx1< len; idx1++) { 
    i = idx1; 
    for (idx2 = idx1; idx2< len; idx2++) 
      i = (list[idx2] < list[i])?idx2:i; 
    if (i == idx1) 
      continue; 
    idx2 = list[idx1]; 
    list[idx1] = list[i]; 
    list[i] = idx2; 
  } 
  return; 
} 
/* divide */ 
else  
{ 
  sort (idx, list, list_); 
  // ...
}

Something interesting worth noting are the big jumps somewhere between 60 and 80, and 120 and 140, which I’m sure are 64 and 128. This is because that marks a branching point for merge sort. So by having an insert sort at 65, you ‘save’ the merge sort a branch. I don’t see any savings though (btw, 64 is the most often quoted cutoff I’ve seen).

Finally, I decided since I’ve already done the effort, why not plot graphs for Bubble Sort and Insertion Sort (these were the only two sorts I did in the past).

Nothing particularly interesting to see, other than proof that bubble sort is the worst sorting algorithm of all time. Also, I’d like to point out how smooth the graphs are. That’s it for today. Hopefully, more graphs are on their way.


  1. This more for my amusement than anything else. I think it is absolutely unneccessary to license code as trivial as shown in this blog.

  2. Introduction to Algorithms : Cormen, Rivest, et. al I recently found out that a friend of mine studying economics at Dartmouth (Darth Mouth ;-)), had a course taken by one of these authors. Which is just awesome beyond words. Pity he wasn’t doing a solid algorithms course with them though.

  3. For people interested in my GSoC progress, I’m happy to say that the plugin is fundamentally complete, and requires some bug fixing and feature bloating. There are a few changes to the Anjuta source code, and until that’s been approved (which requires me cleaning it up and sending it first), I can’t release it. When I do though, it’d be awesome if you’d try it out. Vim is something that has so many bells and whistles, its hard to make sure it doesn’t fail on somebody else’s configuration.

  4. Jeffery Steadfast Nearest Power of Two

  5. The Algoritmist’s take