Courses/CS 2124/Lab Manual/Efficient Sorting

From A-State Computer Science Wiki
Jump to: navigation, search

Introduction

In this lab, you will develop algorithms for sorting data that are more efficient than the primitive Bubble Sort and Selection Sort that you have seen previously.

The Quicksort

The Quicksort algorithm was developed in 1960 by Tony Hoare. It is a divide-and-conquer approach in which the array is repeatedly partitioned into two sub-arrays and then those are themselves sorted by Quicksort. A recursive specification of the algorithm is:

quicksort([in/out]array, left_index, right_index):
    if left_index < right_index:                              // being "unsorted" requires 2 or more values...
        division = partition(array, left_index, right_index)  // smaller values to left of division, larger to right
        quicksort(array, left_index, division-1)              // now quicksort everything left of division
        quicksort(array, division+1, right_index)             // and everything right of division

Where the partition step involves choosing an array element (referred to as the pivot element) and then moving all values that are less than the pivot to the left partition, and all values that are larger to the right partition, with the pivot element separating the two. At this point, the pivot itself is in its final sorted position. The partition algorithm is:

partition([in/out]array, left_index, right_index):
    pivot_index = choose_pivot(a, left_index, right_index)
    pivot_value = a[pivot_index]                 // pivot value for easy reference
    swap(a[pivot_index], a[right_index])         // move pivot to end (out of the way)
    pivot_index = left_index                     // start looking for "correct" location
    for each i from left_index to right_index-1: // by scanning the entire array
        if a[i] < pivot_value:                   // if value smaller than pivot is found
            swap(a[i], a[pivot_index])           // swap it into left partition
            pivot_index++                        // and move the dividing point
    swap(a[pivot_index], a[right_index])         // swap pivot into its correct location
    return pivot_index                           // and return the location of the split

One detail that is not specified in the algorithm is exactly how the choose _pivot step is performed. Any element in the array being partitioned is a possible candidate for the pivot, but some choices are better than others. Ideally, the pivot should be the median value in the array, thus producing a near-perfect halving of the array into the two partitions.

In practice, however, it would be too expensive to compute the median value. A faster option would be to simply choose the first value in the array as the pivot; if the array were already ordered, however, this would lead to either the left or right partition being empty on every partitioning step, giving the worst-case performance ({\textstyle O(n^2)}) for Quicksort.

A reasonable compromise is to choose a pivot value at random from within the array. This has the advantage of being computationally inexpensive, and also avoiding worst-case performance even if the array values are already ordered… Of course the performance can still degrade if the randomly chosen sequence of pivot values are very “unlucky” (meaning in this case that they are always near the extremes of the values, and never near the median).

In quicksort.h, implement the following functions to perform partitioning and sorting with Quicksort recursively as described above:

int partition(int array[], int left, int right) : Perform an in-place partition (using the algorithm shown above) on array within the range {\textstyle [left,right]} (inclusive on both ends). The pivot should be chosen at random from within the range of indices under consideration. The final index of the pivot, which marks the dividing point for the two partitions, should be returned. The partition should place all values smaller than the pivot in the left partition and all values greater than or equal to the pivot in the right partition.

void quicksort(int array[], int left, int right) : Perform the Quicksort algorithm on array, given the array itself and the indices of the left and right ends of the range within the array to be sorted. Note that the range {\textstyle [left,right]} is inclusive on both ends. The array should be sorted in-place, using the recursive Quicksort algorithm presented above.

void quicksort(int array[], int size) : Perform the Quicksort algorithm on array, given the array itself and its size. This function is provided as a user-friendly interface, but should simply call the 3-parameter version internally.

As you work, you should create a main() function containing the necessary code to test your partition() and quicksort() functions.


Labcheckpoint.png FOR IN-LAB CREDIT: Demonstrate your partition() and quicksort() functions.


Heaps

A heap is a special case of a tree in which the parent-child relationship also specifies priority. Heaps come in two varieties:

min-heap : a min-heap is a tree in which any non-leaf node contains a value that is less-than the value of any of its children.
max-heap : a max-heap is a tree in which any non-leaf node contains a value that is greater-than the value of any of its children

The formal definition of a heap is given recursively, in a bottom-up manner, as follows:

* The empty tree is a heap.
  • (for a min-heap) A subtree rooted at node {\textstyle R} is a min-heap if each of {\textstyle R}’s child subtrees are min-heaps and the values of each of {\textstyle R}’s direct descendants (if they exist) are greater than or equal to the value at {\textstyle R}.
  • (for a max-heap) A subtree rooted at node {\textstyle R} is a max-heap if each of {\textstyle R}’s child subtrees are max-heaps and the values of each of {\textstyle R}’s direct descendants (if they exist) are less than or equal to the value at {\textstyle R}.

While the most abstract concept of “heap” doesn’t define the number of children a node may have, it is common and convenient to represent heaps as binary trees. Additionally, heaps are often further constrained as follows:

  • Let the depth {\textstyle D} of the heap be defined as:
    • {\textstyle D=0} for the empty heap,
    • {\textstyle D=1} for the heap with a single element
    • {\textstyle D=} the longest available path from the root to any leaf, plus 1, otherwise.
  • All leaf nodes must be at depth {\textstyle D} or {\textstyle D-1} (meaning the tree must be complete or nearly-complete).
  • All leaf nodes must be in the left-most available locations (when viewing the heap as a tree using the traditional root-at-top diagram).

These constraints allow us to easily map a binary heap into an array by defining the following mathematical relationships:

  • The left child of the array element at index {\textstyle i} is located at index {\textstyle 2i+1} (if a left child exists)
  • The right child of the array element at index {\textstyle i} is located at index {\textstyle 2i+2} (if a right child exists)
  • The parent of the array element at index {\textstyle i} is located at index {\textstyle \lfloor {{i-1}\over 2} \rfloor} (except for the root, which is at index {\textstyle 0} and has no parent)

For simplicity, let the following algorithms be used for the remainder of this discussion:

left(i):                                                        // calculate index of left child
    return 2 * i + 1                                            // which must be bounds-checked before use!

right(i):                                                       // calculate index of right-child
    return 2 * i + 2                                            // which must be bounds-checked before use!

parent(i):                                                      // parent is at floor( (i-1)/2 )
    return truncate( (i - 1) / 2 )                              // this too must be bounds-checked before use!

Basic Heap Operations

A brief description of the two most basic heap operations is given below. Other operations on heaps are possible as well, but for the purposes of developing the Heap Sort algorithm, any additional operations are unnecessary.

Insert

Inserting an item into an array-based heap is a simple matter of adding the item to the end of the array, then restoring the heap property:

max_heap_insert(new_value, [in/out]heap_array, [in/out]heap_size):
    i = heap_size                                               // add at the current end of
    heap_array[i] = new_value                                   // the heap + 1, then
    while i > 0 and heap_array[i] > heap_array[parent(i)]:      // restore heap property
        swap(heap_array[i], heap_array[parent(i)])              // by moving new value up the tree
        i = parent(i)                                           // while it is larger than its parent
    heap_size++                                                 // heap just got larger
Information Icon.png

A Brief Note on Notation: The [in/out] notation in the parameter list above denotes parameters that act as both input and output parameters; in C++ terms, this would mean that they are passed by reference.

Remove

Generally, the only place that is “interesting” for removal from a heap is at the root (the maximal or minimal value). If the heap is contained in an array, the root element is at index 0 (the first item in the array). It would be expensive to truly “remove” it here, since that would mean shifting all of the other elements “left” by one location (a {\textstyle O(n)}, or linear, operation). Instead, we will play a “trick”. By swapping the element at the root (the one we want to “remove”) with the last element in the heap, we can move it “out of the way”. We will then decrease the size of the heap by 1, effectively “removing” the item that we swapped to the end. Then, we just have to move the new root element down the heap until the heap property is restored. This operation is often referred to as sift-down. The algorithm for remove and sift-down is shown below:

heap_remove([in/out]heap_array, [in/out]heap_size):
    result = heap_array[0]                                      // save the result to return later
    swap(heap_array[0], heap_array[heap_size-1])                // swap root to the end 
    heap_size--                                                 // mark the heap as one item smaller
    max_heap_sift_down(heap_array, heap_size)                   // sift down the element just swapped to root
    return result                                               // and return the saved result

max_heap_sift_down([in/out]heap_array, heap_size):              // a simpler interface given array and size
    max_heap_sift_down(heap_array, 0, heap_size-1)              // calls the more general algorithm shown below

max_heap_sift_down([in/out]heap_array, left_index, right_index):
    done = false                                                // assume there is work to be done
    i    = left_index;                                          // starting with the "root" (left-most item)
    while not done and left(i) <= right_index:                  // scanning until done or h[i] has no children
        max_child = left(i)                                     // assume left child is max
        if right(i) <= right_index                              // see if right child exists
           and heap_array[right(i)] > heap_array[left(i)]:      // and is larger than left
            max_child = right(i)                                // if so, mark right child as max_child
        if heap_array[max_child] > heap_array[i]:               // if the max_child is bigger than h[i]
            swap(heap_array[max_child], heap_array[i])          // swap h[i] with its max_child
            i = max_child                                       // and continue scanning from that index
        else:                                                   // otherwise, no swap was made, so
            done = true                                         // the heap property holds, and we can stop

Note: The sift-down algorithm above is overloaded with two interfaces; one is “simple”, requiring only the array and its size. The other is more general, allowing the index of the left and right sides of a range (inclusive on both ends) to be passed as parameters. This more general version will become useful in the bottom-up heapify algorithm developed below.

Implement the algorithms shown above (in the file maxheap.h) so that you can use an array of integers as a max-heap. Test your functions by creating a heap by adding at least 25 random integer values to an array. Display the random values being added to the screen, and then display the heap created from those values. Check that the heap property holds. Now, add a loop to remove all the values from the heap, printing them to the screen as they are removed. When you are satisfied with the functionality of your code, demonstrate it to the lab proctor.


Labcheckpoint.png FOR IN-LAB CREDIT: Demonstrate your heap functions: max_heap_insert() heap_remove() max_heap_sift_down() max_heap_sift_down()


Heapify

Given an array containing unordered data, a heap can be constructed in one of two ways:

top-down : Construct the heap by using the algorithm for insertion into an empty heap. This can be performed in-place (see below).
bottom-up : Construct the heap by first making heaps from all of the subtrees (beginning at the leaves) and working toward the root (see below).

Top-Down Heapify

Given the algorithms shown above, a top-down heapify can be performed in-place simply by “pretending” that the array is empty initially, then “adding” the successive elements to it. The reality is that the array already contains unordered values, but there is nothing in the max_heap_insert() algorithm that would damage any of the data beyond the perceived “end” of the heap:

make_max_heap_top_down([in/out]array, array_size):
    heap_size = 0                                               // pretend that the heap begins empty
    while heap_size < array_size:                               // while all elements are not yet added
        max_heap_insert(array[heap_size], array, heap_size)     // "add" elements (in-place) to the heap

This algorithm almost looks too simple to work, but remember that the logical size of the heap is starting at zero and growing by one on each iteration of the loop, and the max_heap_insert() algorithm modifies the value of its heap_size parameter, thus modifying the value of the loop control variable.

Bottom-Up Heapify

Another way to look at the operation of creating a heap from an existing unordered array is to construct it from the bottom up. What this means is that you start with the leaves, and work your way up the tree hierarchy, enforcing the heap property at each level. This algorithm relies on the recursive formal definition of a heap shown earlier. The algorithm is:

make_max_heap_bottom_up([in/out]array, array_size):
    subroot = parent(array_size-1);                             // start with the last non-leaf node
    while subroot >= 0:                                         // run until we reach the true root
        max_heap_sift_down(array, subroot, array_size-1)        // sift down the new subroot's value
        subroot--                                               // and move left to the next subroot

The bulk of the work here is being performed by the sift_down algorithm; the heap is being built by joining together all of the subtrees and maintaining their heap properties by sifting down the new root each time. While this sounds almost like the exact reverse of the top-down algorithm presented above, there is a subtle difference: In the top-down algorithm, new data is being added at the bottom of the tree, but the number of swaps required to restore the heap property increases with depth. The bottom-up algorithm inserts new values higher in the tree structure, so fewer swaps are necessary to restore the heap property. Because of this, the overall complexity of the top-down algorithm is {\textstyle O(n \cdot \textrm{lg}(n) )}, while the bottom-up algorithm has {\textstyle O(n)} (linear) complexity.

Implement both the top-down and bottom-up heapify algorithms shown above in the file maxheap.h. Test your code by creating an array filled with at least 25 random values, and making two copies of it. Heapify one copy with the top-down algorithm, the other with the bottom-up algorithm, and leave the third unmodified. Print the values to the screen so that there are three columns of output: the original array value, the top-down heapified value, and the bottom-up heapified value. When you are satisfied that your make_max_heap functions are working properly, demonstrate them for your lab proctor.


Labcheckpoint.png FOR IN-LAB CREDIT: Demonstrate your make_max_heap functions.


Heap Sort

The heap sort algorithm is defined in terms of the previous heap operations, and makes clever use of the properties of an array-based heap. The algorithm is as follows:

heap_sort([in/out]array, array_size):
    make_max_heap_bottom_up(array, array_size)                  // begin by making the array a heap
    right_index = array_size - 1                                // mark the index of the right edge
    while right_index > 0:                                      // loop until the right side reaches 0
        swap(array[0], array[right_index])                      // swap the max value to the right side
        right_index--                                           // move the right end marker
        max_heap_sift_down(array, 0, right_index)               // sift down the value swapped to root

This is a very simple algorithm! Once you create the heap using a heapify algorithm, it is simple to repeatedly take the root (which is the current remaining maximum value) and swap it into place at the “end” of the ever-shrinking “working” array… Then a sift-down step will restore the heap property to the remaining array. Repeat this for each of the {\textstyle N} elements in the array.

Closer observation will point out that you have already written the code necessary to do this! The remove algorithm performed the same “swap,shrink,sift-down” operations that are required here! Since the remove algorithm also doesn’t actually harm the array in any way (it simply reduces the logical size, but the removed items are really still there at the end of the physical array) — we can just call it! This simplifies the heap sort to:

heap_sort([in/out]array, array_size):
    make_max_heap_bottom_up(array, array_size)                  // begin by making the array a heap
    elements_remaining = array_size                             // store the initial number of elements
    while elements_remaining > 1:                               // while the unsorted part of the array is non-empty
        heap_remove(array, elements_remaining)                  // "remove" the max (which moves it to the end!)

In this simplification, we are not using the value returned by the heap_remove function; we are instead relying on the side-effect of it’s moving the element being removed to the end of the logical array. This simplification does not reduce the computational complexity of the heap sort algorithm, but it makes the code cleaner and takes full advantage of the DRY (Don’t Repeat Yourself) Principle.

Implement the heap_sort() function in the file heapsort.h. Add code to your main() function to make a third copy of the unordered array of random values you used earlier. Sort this copy of the array, and add output so that your test output now has four columns: the original value, the heap (using top-down heapify), the heap (using bottom-up heapify), and the sorted value. When you are satisfied with the results of your function, demonstrate your program to the lab proctor.


Labsubmitsinglefile.png FOR IN-LAB CREDIT: Zip up these files: All files necessary to build and run your project.
Name the file efficient_sorting.zip and upload to CSCADE.