2010 May 23

Sorting out sorting

Sorting arrays is one of the simplest things for the human mind, but to put it into code is a little different. We can look at the numbers 7, 4, 4, 12, and 5 and know what order they go into (4, 4, 5, 7, 12), but to a computer the numbers are just numbers and don't really mean anything. They're simply a tool for the program to do something with; as far as the computer cares, the order is irrelevant.

But many, many programs require sorting. Imagine a tax recording program, that has to sort by value. Imagine a spreadsheet program that has to sort cells by date added. Perhaps you have a game that sorts users by score. Even the text you're reading now is stored in MySQL and is sorted by date added (or ID).

No matter how you slice it, programs have a need to sort things, be it numerically or otherwise. But especially for beginner programmers, thinking of how our brain sorts those numbers isn't so easy.

Bogosort

The most ridiculous sorting algorithm is Bogosort, which is described best by Wikipedia: "If bogosort were used to sort a deck of cards, it would consist of checking if the deck were in order, and if it were not, one would throw the deck into the air, pick the cards up at random, and repeat the process until the deck is sorted." Obviously this is not a very efficient algorithm and therefore you would never see it in an actual program save for humorous purposes. (Seen below.)

# Bogosort Implementation
# This does not use a user-defined function, because I didn't take the time to add it.

import random;
deck = range(1,53);
count = 0;
while not all(x <= y for x, y in zip(deck, deck[1:])):
    random.shuffle(deck);
    count += 1;

print "Took", count, "tries!";

Bubblesort

Probably the most elementary algorithm is Bubblesort. It consists of going through the array and swapping elements that are not in order until you reach the end of the array, then starting at the beginning again. This is incredibly inefficient, as you can see:

7, 4, 4, 12, 5
4, 4, 7, 12, 5
4, 4, 7, 5, 12
4, 4, 5, 7, 12

With just four numeric elements, it took us four loops to get it right. Imagine sorting 100,000 elements. Nevertheless, here is the implementation:

# Bubblesort Implementation
# Input requires an array of elements and a callback function, such as the "sort_numeric" function that follows.

def bubblesort(narray, f):
    L = range(0,len(narray)-1);
    while True:
        s = 0;
        for i in L:
            if (f(narray[i],narray[i+1]) == 1):
                s += 1;
                sv = narray[i];
                narray[i] = narray[i+1];
                narray[i+1] = sv;

        if (s == 0): break;

    return narray;

def sort_numeric(a, b):
    if (a > b): return 1;
    elif (a < b): return -1;

    return 0;

print bubblesort([7, 4, 4, 12, 5],globals()[str("sort_numeric")]);

Now we're kind of beyond the idea of inefficient and humorous sorting algorithms. The next up is called Insertion Sort and a similar type of code that I wrote myself, that I called Swap Sort.

Insertion sort

Basically what Insertion does is look through the array for the first element of a sorted array. Once it finds the first element, it adds it to the "sorted" array and removes it from the "unsorted" array. It repeats this process until there are no more items in the unsorted array.

SORTED: [] - UNSORTED: [7, 4, 4, 12, 5]
SORTED: [4] - UNSORTED: [7, 4, 12, 5]
SORTED: [4, 4] - UNSORTED: [7, 12, 5]
SORTED: [4, 4, 5] - UNSORTED: [7, 12]
SORTED: [4, 4, 5, 7] - UNSORTED: [12]
SORTED: [4, 4, 5, 7, 12] - UNSORTED: []

The advantage here is that with less than 10 or so items, there is very little overhead runtime and memory usage. However, if you're using this for sorting a database that can contain a million items, you're probably going to want to upgrade.

# Insertion Sort Implementation
# This is not true Insertion Sort, but a bit of a variation that only uses one copy of the array.
# Input requires an array of elements and a callback function, such as the "sort_numeric" function from the Bubble Sort example.

def insertionsort(narray, f):
    L = range(1,len(narray));
        for i in L:
            v = narray[i];
                j = i-1;
                while (j >= 0 and f(narray[j],v) == 1):
                    narray[j+1] = narray[j];
                        j -= 1;
                    narray[j+1] = v;
            return narray;

print insertionsort([7, 4, 4, 12, 5], globals()[str("sort_numeric")]);

A similar idea of mine is called Swap Sort and works by looping through the array one element at a time, then checking where that element belongs in the array, then putting it there. It's slightly (and I mean slightly) slower than Insertion Sort, but gets the job done. It's a bit long to paste here, so I'll omit it.

Quicksort

I was going to cover Merge Sorting here, but instead I'm going to leave that to last. Here I'll briefly cover Quicksort and it's amazingness.

Quicksort is probably the easiest sorting algorithm to implement in Python if using a custom user function to sort. (If you're just sorting numerically, you can use sorted(array); and boom, done.)

Quicksort starts with a pivot; which is usually the first item in the array but can be any. If then quickly scans the entire array for any items "less" than that, and any items "greater" than that, and places them on the left and right side respectively. It then sorts the left and right hand sides individually using a recursive call.

Using Python's advanced for loop code, you can do this algorithm simply by doing:

# Quick Sort Implementation
# Input requires an array of elements and a callback function, such as the "sort_numeric" function from the Bubble Sort example.

def quicksort(narray, f):
    if (narray == []): return [];
    pivot = narray[0];
    return quicksort([x for x in narray[1:] if f(x,pivot) <= 0],f) + [pivot] + quicksort([x for x in narray if f(x,pivot) > 0],f);

print quicksort([7, 4, 4, 12, 5],globals()[str("sort_numeric")]);

Merge sort

I could cover super advanced algorithms like TimSort (which seems to beat all these suckers substantially) or other neat sorting funtions, as there are no less than 30 of them in common practice out there.

Instead I'll cover Merge Sorting and call it a night.

Merge Sorting works a bit like QuickSort. But instead of using a pivot, it simply divides the array in half, then sorts each of the halves and merges them together. The idea is that each time they're merged, they're sorted; so when you merge [1] and [2], it returns [1,2]. That can be merged with [5,6] (which must be sorted already) to get [1,2,5,6].

[7, 4, 4, 12, 5]
[4] + [4, 7] + [5, 12]
[4] + [4, 5, 12] + [7]
[4, 4, 5, 7, 12]

Basic implementation is simple:

# Merge Sort Implementation
# Input requires an array of elements and a callback function, such as the "sort_numeric" function from the Bubble Sort example.

def mergesort(narray, f):
    if (len(narray) <= 1): return [narray[0]];
    middle = int(len(narray)) / 2;
    left = mergesort(narray[0:middle],f);
    right = mergesort(narray[middle:],f);
    result = [];
    while (len(left) > 0 and len(right) > 0):
        cmpthis = f(left[0],right[0]);
        if (cmpthis <= 0):
            result.append(left[0]);
            left = left[1:];
        else:
            result.append(right[0]);
            right = right[1:];
    if (len(left) > 0): result.extend(left);
    else: result.extend(right);

    return result;

print mergesort([7, 4, 4, 12, 5],globals()[str("sort_numeric")]);

However, if you want to get REALLY amazingly fancy with this, you can use buffers and remove allocations that bog down Python's memory and speed this algorithm up faster than QuickSort. And please, don't ask me how this works, I didn't write this.

# Advanced Merge Sorting algorithm
# Author: StealthEye

def mergesort_internal_merge(input, output, f, begin1, end1, begin2, end2, beginOutput, endOutput):
    for i in range(beginOutput, endOutput):
        if (begin1 >= end1):
            difference = 1;
        elif (begin2 >= end2):
            difference = -1;
        else:
            difference = f(input[begin1], input[begin2]);

        if (difference < 0):
            output[i] = input[begin1];
            begin1 += 1;
        else:
            output[i] = input[begin2];
            begin2 += 1;

def mergesort(data, f):
    buffer = range(len(data));

    dataLength = len(data);
    dataEnd = dataLength;

    halfChunkLength = 1;
    chunkLength = halfChunkLength * 2;
    while halfChunkLength < dataLength:
        chunkBegin = 0;
        chunkMid = halfChunkLength;
        chunkEnd = chunkLength;
        while (chunkEnd < dataEnd):
            mergesort_internal_merge(data, buffer, f, chunkBegin, chunkMid, chunkMid, chunkEnd, chunkBegin, chunkEnd);
            chunkBegin += chunkLength;
            chunkMid += chunkLength;
            chunkEnd += chunkLength;

        if (chunkMid < dataEnd):
            mergesort_internal_merge(data, buffer, f, chunkBegin, chunkMid, chunkMid, dataEnd, chunkBegin, dataEnd);

        temp = data;
        data = buffer;
        buffer = temp;

        halfChunkLength = chunkLength;
        chunkLength *= 2;

    return data;

So as you can see, there's tons of different methods and ways to enhance great sorting algorithms. Were we to put these into C++, they would be even faster as Python is a script; it has to run its code through another code to execute, where as C++ runs natively on the computer.

Be sure to check out the aforementioned Timsort (and more info), which is faster than this Merge Sort algorithm still! Unfortunately, I haven't managed to find its code yet, so if someone does please pass it on! :)

Next Post Previous Post

Eric Kumlin