Order: sorting and order statistics for the PostScript language

anastigmatix.net

This document has a standard, validated CSS2 stylesheet, which your browser does not seem to display properly. In a browser supporting web standards, this table of contents would be fixed at the side of the page for easy reference.

anastigmatix home

Sorting and order statistics

Order: reference

Order dictionary contents

Comparison

Searching

Min and max

Insertion techniques

InsSortStack (deprecated)

“Quick” techniques

Heap techniques

Utilities

Order: Sorting and order statistics in PostScript

Version 0.1 6, which was downloadable for a few hours on 21 September 2005, had a bug in Select. If you have that version please update.

This resource is intended to provide reasonable implementations of several algorithms for searching, sorting and order statistics, generally implemented out of Cormen, Leiserson, and Rivest. The procedures in this resource are all allocation-free, using only the stacks and arrays or strings passed by the caller, and can be applied interchangeably to arrays and strings, treating a string as a compact array of small integers.

Procedures are provided for efficiently finding minimum and maximum, searching sorted data, sorting small data sets by insertion, sorting larger sets or extracting order statistics using quicksort and techniques derived from it, and sorting in worst-case optimal time (though usually more slowly than quicksort) using heaps and their related operations, which are also useful for constructing other interesting structures like priority queues.

Order: reference

Order is a ProcSet resource. To make it available to your own code, include in the setup section of your file:

/net.anastigmatix.Order /ProcSet findresource

The findresource will succeed, leaving a dictionary on the operand stack, if you have made the Order resource file [download] available in any of these ways:

You have included the resource file in the prolog of your own file, before the findresource (which belongs in the setup section)
You downloaded the resource file to your printer's persistent memory, since it was last powered on
You saved the resource file in your printer's (if it has a disk or flash filesystem), or ghostscript's, resource directory under the right name
You use document manager software that recognizes the %%DocumentNeededResources and %%IncludeResource DSC comments, you include these comments at the right position in your file to specify that it needs net.anastigmatix.Order, your document manager software is configured to automatically insert needed resources in files being printed, and you have put the Order resource file where your document manager can find it.

The resource files are in a compact form. That is for efficiency, not to keep you from viewing them; there is a script for that on the resource packaging page.

The Order dictionary may be placed on the lookup stack (with begin) for convenient access to the definitions in it, without the bother of get and exec. The dictionary is read-only, so before creating any definitions, you will want either userdict begin or your own dict begin so that you have a writable dictionary on top of the dictionary stack.

Order dictionary contents

This section describes the contents of the read-only dictionary that is returned by /net.anastigmatix.Order /ProcSet findresource.

Comparison procedure

A comparison function for most procedures in this resource is a procedure that consumes two objects from the stack and returns only -1 if the first object should precede the second (topmost) object, zero if they should be considered equal, and 1 if the first should follow the second. A comparison function should be efficient and well tested; it may be called many times, and the stack is not protected: beneath the two objects it is expected to consume, values internal to one of these procedures may be visible, and problems resulting from a malfunctioning comparison that has the wrong stack effect may be difficult to trace.

In some other packages that may be familiar, comparison functions are expected to return any negative value, zero, or any positive value, respectively. That definition must not be used with procedures in this resource. Comparison results are restricted to -1, 0, or 1 because they may be used directly to index into a choice of procedures for speed.

The deprecated procedure InsSortStack is the only one in this resource taking a comparison function that can safely use any negative or positive value, only because InsSortStack was included before the restriction to -1, 0, 1 was documented. Values other than -1, 0, 1 may work with other procedures in the current version, but are not supported and the behavior must not be relied on.

One predefined comparison function is provided that applies the standard PostScript comparisons for integers, reals, or strings:

PolyCmp: num1|str1 num2|str2 PolyCmp -1|0|1
a simple comparison procedure polymorphic to numbers or strings. Returns -1 if num1|str1 is less than num2|str2, zero if they are equal, otherwise 1.

Searching

When data can be ordered, binary searching becomes possible.

BinarySearch: arr|str key cmp BinarySearch index bool
searches for the value key in the array or string arr|str, assumed already ordered according to the comparison function cmp. If a matching entry is found, return its index and true, otherwise return the index at which key could be inserted—that is, the index of the least value greater than key or the length of arr|str if none such—and false.

Minimum and maximum

FindMin and FindMax determine both the value and the location of the minimum or maximum entry, respectively, in an array or string. MinMaxCmp and MinMax determine the minimum and maximum values simultaneously, but not their locations, in about three fourths the comparisons that would be needed to determine the minimum and maximum separately.

FindMin

arr|str cmp FindMin index value

given array or string arr|str of length at least 1, and cmp a comparison function, leave index and value on the stack such that value is the value at index index in arr|str, and for no value x in arr|str can x value cmp return -1.

FindMax

arr|str cmp FindMax index value

MinMaxCmp

array cmp MinMaxCmp min max

given array of length at least 1, and cmp a comparison function, leave min and max on the stack such that each is a value from array, and for no value x in array can x min cmp or max x cmp return -1.

This procedure differs from MinMaxCmp only in taking a comparison procedure with the usual behavior (returning -1,0,1), where MinMax takes anything that acts like lt.

MinMax

array ltproc MinMax min max

given array of length at least 1, and ltproc a procedure with the semantics of lt, leave min and max on the stack such that each is a value from array, and for no value x in array can x min ltproc or max x ltproc return true.

This procedure differs from MinMaxCmp only in taking a comparison procedure with the behavior of lt instead of the general -1,0,1 comparison procedure used with most procedures in this resource.

Insertion techniques

The procedures in this family put elements in order by inserting each in turn at the proper position among those already ordered. The simplicity of the procedure makes it faster than the more sophisticated methods when sorting a few items, but the time required grows as the square of the number of items, making the technique a poor choice for larger jobs. Internally, QuickSort and Select hand off to InsSort any job below a threshold size.

InsSort: arr|str cmp InsSort arr|str
rearranges the elements of the array or string arr|str in an order satisfying the comparison function cmp.
InsSortStk: any_n-1...any₀ n cmp InsSortStk any_n-1'...any_0' n
rearranges the n items any_n-1...any₀ on the stack according to the comparison function cmp so that the result of any_i any_j cmp cannot be 1 for any i and j unless i < j.
InsStk: any_n-1...any₀ n cmp anew InsStk any_n...any_i anew any_i-1...any₀ n+1 cmp
places the single item anew in position among the n items any_n-1...any₀ already on the stack ordered according to the comparison function cmp as for InsSortStk. The count n, incremented, and the comparison function cmp remain on the stack. This procedure is intended for growing an ordered stack of items one by one.
InsSortStack: any_n-1...any₀ cmp n InsSortStack any_n-1'...any_0' n
Deprecated. This procedure differed from InsSortStk only in the order of arguments (cmp and n were interchanged), and in that it accepted a cmp function that could return arbitrary negative or positive integers and not only -1, 0, or 1—such a function is not supported for other procedures in this resource. For compatibility, InsSortStack is retained as a wrapper that calls InsSortStk with transformed arguments.

“Quick” techniques

A number of interesting algorithms can be built around C. A. R. Hoare's quicksort or its handy partition operation. The sort has an expected complexity of n log n, the best possible for a comparison sort, and is also just plain faster than many algorithms with the same theoretical bound. However, the rare input can be in just the right order to make quicksort take n² time. This implementation uses a (non-randomized) median-of-3 technique to make that event less likely; in particular, input that is already sorted will not slow QuickSort down.

If consistency is more important than high speed, HeapSort is an alternative that will take about 2.5 times as long as QuickSort on typical inputs, but guarantees its n log n time bound for all cases.

Useful procedures that can be built on partition include Select, for obtaining the ith order statistic of n elements—what would be the ith element if they were in order—in expected time proportional to n, which beats the obvious approach of comparison-sorting the elements first. Like QuickSort itself, Select can be tricked by certain unlucky inputs into exceeding the expected time bound, but is careful to make that unlikely.

A conservative version, SelectWC, guarantees linear time even in the worst case; it is packaged separately. As with HeapSort versus QuickSort, you pay for that guarantee in generally longer running times overall, about five times longer than Select on typical inputs. That makes just sorting the array look much more attractive—though if you are concerned about worst-case predictability, you will naturally be comparing SelectWC to HeapSort rather than QuickSort, and in that comparison there can still be an advantage.

Based on Select, OrderStatistics can gather any k distinct order statistics from the same array in one call, in expected time proportional to n log k (it would cost you nk to just call Select k times); it exploits the side effect that Select leaves the array partitioned around the selected element. The benefit increases with array size; to get more than a few order statistics from a small array (under 100, say), it is probably faster to just sort the array and pick out the right elements. From the largest possible PostScript arrays, you can gather over 100 order statistics using OrderStatistics in less than the time to QuickSort the array.

OrderStatistics takes the selection procedure to be used as a parameter, so you can have it use SelectWC instead of Select if you want worst-case predictability. It is also possible that you have an array that you know is already sorted, or that is so small you are better off sorting it, but OrderStatistics is still a convenient API for getting the elements you want. In that case you can have it use SelectOrd, a dummy selection that simply assumes the array is already ordered. If it isn't, of course, you get bogus results.

Partition

arr|str pivot cmp Partition post mid pre

partitions the array or string arr|str around the value pivot, which must be a value that occurs in arr|str. On completion, the initial subrange pre contains only values <= pivot, a possibly empty subrange mid contains only values equal to pivot that have been moved to their correctly ordered positions,and the final subrange post contains only values >= pivot, according to the comparison function cmp.

Because values equal to pivot are allowed anywhere in pre and post, it is never necessary for mid to be nonempty. The partitioning process does not necessarily move elements that equal pivot to their final positions, and it does not necessarily recognize all cases when it has. A nonempty mid is produced only when the algorithm can determine without extra work that at least one value equal to pivot has been put in its final place. What is guaranteed is that Partition will never return the entire original arr|str as pre or post; progress is always made. Unlike some implementations, Partition does not require pivot to be the first element, or any particular element, as long as there is at least one equal value present somewhere in arr|str.

QuickSort

arr|str cmp QuickSort arr|str

rearranges the elements of the array or string arr|str in an order satisfying the comparison function cmp. Below a threshold size, QuickSort hands off to InsSort.

Select

arr|str i cmp Select post mid pre

locates the ith order statistic from the array or string arr|str, that is, the value that would be at index i-1 if arr|str were in order by the comparison function cmp. (Order statistics are 1-based by convention, while PostScript array and string indices are 0-based.) It is returned in the singleton subrange mid. As a side effect, all elements in the initial subrange pre and final subrange post are known to be <= and >= that element, respectively, just as if Partition had been called with that element as pivot. FindMin and FindMax are used to optimize the cases where i = 1 or the length of arr|str.

SelectOrd

arr|str i cmp SelectOrd post mid pre

a trivial version of Select for use when arr|str is known to be already in order by the comparison function cmp. Its most likely use is as the select parameter to OrderStatistics when the input is known to be sorted.

OrderStatistics

arr|str wanted cmp sel OrderStatistics wanted

uses the selection procedure sel to gather from the array or string arr|str all of the order statistics requested in wanted. wanted is an array or string in which each element is an integer i between 1 and the length of arr|str. On completion, wanted is left on the stack with each element replaced by the corresponding order statistic from arr|str, that is, the value that would be at index i-1 in arr|str if ordered by the comparison function cmp.

wanted must be in ascending order and free from duplicates (requesting the same order statistic more than once is not only inefficient, but likely to cause a rangecheck). If wanted is a string, the selected elements of arr|str must be integers between 0 and 255, or a rangecheck or typecheck will result.

OrderStatistics recognizes a special case when wanted of length two requests the first and last order statistic, and uses MinMaxCmp instead of sel. In that case, the values are found without being placed in their proper locations as a side effect, as sel would do. OrderStatistics may use this optimization during recursion as well as on the initial call, so it is in general simply not safe to assume that the values requested have arrived at their proper locations in arr|str when OrderStatistics completes.

Example. Gather the “five-number summary” (minimum, quartiles, and maximum) from an array of 15 observations:

PS>/obs [7 49 73 58 30 72 44 78 23 9 40 65 92 42 87] def
PS>obs [1 4 8 12 15] //PolyCmp //Select OrderStatistics ==
[7 30 49 73 92]

Note that, if the number of observations is even, the definition of the five-number summary is trickier, and involves interpolation. For example, if there were 14 observations, you would request the six order statistics [1 4 7 8 12 14] and average the seventh and eighth. So, a procedure to correctly compute arbitrary quantiles or five-number summaries would still be not completely trivial to write; it would have to use the array length to determine which order statistics were needed, use OrderStatistics to gather them, and then interpolate between adjacent ones as needed.

Heap techniques

Not only HeapSort itself, but the building block Heapify and BuildHeap operations are exported, because they can be used in other useful data structures such as priority queues, and interesting algorithms built on them.

HeapSort, though it runs about 2.5 times as long as QuickSort in most cases, has the advantage that its worst-case time bound is the same as the average case. If worst-case predictability matters more than raw speed, HeapSort can be an attractive option. HeapSort hands off to QuickSort below a certain size, where even QuickSort's worst-case time cannot be much worse than what HeapSort would take anyway.

Because PostScript indexes arrays and strings starting with zero, this resource defines the following heap structure:

Root:	index 0
Left child from index i:	i `1 bitshift 1 add`
Right child from index i:	i `1 add 1 bitshift`
Parent from index i:	i `1 sub -1 bitshift`

Those operations are not exported as named procedures, because they are short and any serious application will inline them anyway.

Heapify: arr|str cmp heapsize i Heapify arr|str cmp heapsize
given array or string arr|str, comparison function cmp, and heapsize such that heapsize-1 is the greatest index at which arr|str contains a heap node, and i such that the left and right children from index i are already the roots of heaps, but the value at i may not satisfy the heap property (that is, it may compare less than its left or right child), moves elements so that the heap property is satisfied at index i. The time bound is given by log heapsize.
BuildHeap: arr|str cmp BuildHeap arr|str cmp heapsize
given array or string arr|str and comparison function cmp, arrange elements so the entire array forms a heap rooted at index zero. The length of arr|str is left on the stack as heapsize. The time bound is given by heapsize.
HeapSort: arr|str cmp HeapSort arr|str
rearranges the elements of the array or string arr|str in an order satisfying the comparison function cmp. Below a threshold size, HeapSort hands off to QuickSort.

Utilities

Hist

data cutoffs bins cmp Hist bins

creates a histogram of the array or string data in the array or string bins, which must have length at least k+1 where k is the length of the array or string cutoffs. Each element of bins is incremented by the number of elements of data less than or equal (by the comparison function cmp) to the corresponding element of cutoffs but not less than or equal to any lower element of cutoffs. Values greater than cutoffs_k-1 are counted in bins_k. The cutoffs must be in ascending order.

If bins is an array, any of its elements 0..k that are not integers will be initialized to 0. No other initialization is done, so bins can be used to accumulate counts from multiple data sets. The largest index of any element of bins ever touched is k.

As PostScript always initializes a freshly allocated string to zeros and a freshly allocated array to nulls (which Hist will replace with zeros), either may be used reliably to begin a new count.

Uniq

collapses adjacent entries in arr|str, when they compare equal by the comparison function cmp, to single entries, retaining the first of each group of matching entries, and returns the possibly shortened subarray/substring arr|str'. Uniq will always collapse adjacent matching entries, but of course this is guaranteed to eliminate all duplicates only if arr|str is initially ordered. If a literal array or string arr|str2 is present, it should be at least as long as arr|str, and a possibly shortened subarray arr|str2' the same length as arr|str' will be returned, each element of which is the number of occurrences in the original input of the corresponding value in arr|str'.

Trichotomize

arr|str i j Trichotomize post mid pre

slices the array or string arr|str into three intervals, returning the interval 0..i-1 as pre, i..j-1 as mid, and j..length-1 as post, where length is the length of arr|str; the form of result returned by Partition and Select.

sload

arr|str sload arr|str₀ ... arr|str_n-1 arr|str

does exactly what aload does, but for strings. Will work for arrays too, but aload is faster.

sstore

int₀ ... int_n-1 arr|str sstore arr|str

does exactly what astore does, but for strings, so the items on the stack must be integers between 0 and 255. Will work for arrays too, and then the items can be anything. But astore is faster.

swap

arr|str i j swap –

swaps the values at indices i and j in the array or string arr|str.

$Id: Order.html,v 1.6 2006/10/15 23:46:12 chap Exp $