Source Edit

This numeric sort module encapsulates sorting by native numeric keys embedded at some offset inside Nim objects of any size (well, <= 256 bytes for char- keyed obs, but you should "tag sort" objects > 64B anyway). This kind of interface allows low overhead generality and enables algorithms specialized to number types. Such algorithms are often many times faster than comparison sorts. The algo is roughly counting sort for 1-Byte keys and for [248]Byte native type keys an LSD radix sort with optional transforms from signed|float domains. This implementation has several sadly rare optimizations.

FIRST, total order and order by digit0 is checked in the first histogramming pass to decide if remaining work can be skipped. Hence, if run on an already sorted array, only one read-only pass over the data is done to confirm order. Similarly, items already sorted by digit0 induce no re-ordering write phase. Reverse order is also detected. These non-branching integer comparisons add little work and potentially save a lot.

SECOND, only 2 counter buffers are ever needed at once - current pass write pointers & next pass read histogram. Using more wastes limited L1/L2 space and/or shrinks max histogram size (increasing number of passes). The buffers for these two "alternate" histograms just toggle across passes. (A several- histogram-at-once counting pass can also achieve no excess re-fetches, but at higher cache usage. Cache use is same for 2B keys, but only high whole byte constancy can yield skipped 2nd passes. The best 2B method depends on keys.)

THIRD, histogram details are optimized. For moderate n, prefix summing histograms into cumulative distribution functions (really output bin offsets) is a dominant cost. So, this impl does SIMD parallel prefix sum. This optim is more effective as more counters fit into vector registers. So, this impl uses the smallest [1248]Byte counter needed for n items & takes care to align the 2 power of 2-sized counter buffers to maximize vector use. Last, a time cost estimate formula & Newton's method is used to decide pass sizes.

FOURTH, bits that vary across keys are measured (mask or=(x[i-1] xor x[i])) in the first read-pass. Key-wise constant bits are zero in said mask. Since an array MUST be sorted by constant key bits, smart digit plans can skip them to maybe shrink pass count. pext32/pext64 from Intel's Advanced Bit Manipulation subset (Haswell 2014) are a cheap way to use the above mask. This optimization does not help bitwise non-constant data (other than ease of digit formation afforded by pext). However, this impl is structured so there is almost no added cost to access the potentially large savings. Work on digit plans is ongoing. This optimization is novel to my knowledge. It seems small for an academic paper (maybe optimal digit plans would add enough to discuss) but likely to have been mentioned offhand already. I'd cite a reference if I could and please cite me if you don't have one. How well this does depends strongly upon bit-entropy of keys. Eg., time(2) outputs, over allocated 8B ints, 64-bit VM addrs & positive floats spanning 1..2 octaves may be constant above 12-24 LSbits. Some samples may even get sorted in just one pass! I'd bet much real-world f32 data is 2pass w/b0=12.

One undone optimization is multi-threads for counting phases. This can boost L3->L1 read throughput by 2-9x (but since scattered writes are much slower than streaming reads, overall speed-up is likely limited). Another maybe useful optimization is saving transformed keys in the tmp arrays (but this also needs inverse transforms on the final pass & current xforms are already cheap compared to L2 or worse bandwidth costs. So, 5-20% boost, I'd guess.)

Imports

cpuCT, cumsum

Types

XForm = enum xfNone, xfRev, xfSgn, xfSgnRev, xfFlt, xfFltRev ## Transforms: Source Edit

Vars

verb = 0: verbosity level Source Edit

Procs

proc nsort(obs, tmp: pointer; n, sz: int; off: uint8; xf: XForm; b0 = 0): pointer {. ...raises: [], tags: [], forbids: [].}: Specialization of power-user interface for uint8-width keys. Source Edit
proc nsort[O, W](obs: var openArray[O]; off: W; xf = xfNone; b0 = 0): Radix sort obs by W-width key at in-object offset off, transforming keys according to xf. b0 is the number of bits for the first pass histogram. 0 means use a good default. Uses temporary space equal to the size of obs. Does tiled merge sort for big data. Source Edit
proc nsort[W: Ui248](obs, tmp: pointer; n, sz: int; off: W; xf: XForm; b0 = 0): pointer: A power-user interface that allows skipping a final copy back to obs. It uses n to decide what sort algo to use. Returns -1 if reversed. Source Edit

Templates

template nsortBy(x, field: untyped; b0 = 0): untyped: Convenience template around nsort proc to reduce typing. b0 is the number of bits for the first pass histogram. 0 means use a good default.

You can only nsort by one numeric field at a time, but sorts are stable. Do x.nsortBy foo; x.nsortBy bar to do a multi-level sort.

import nsort var recs = @[(f: 3.0, age: 50), (f: 6.0, age: 30), (f: 5.0, age: 30)] recs.nsortBy f # Multi-level sort by `age` then `f`. Invoke recs.nsortBy age # ..in REVERSE order of usual comparison-order. for r in recs: echo r # Right output: @[(5.0,30), (6.0,30), (3.0,50)]
Source Edit
template nsortByRaw(x, field: untyped; b0 = 0): untyped: Used by nsortBy for small objects (or directly by users who know what they are doing). Source Edit
template nsortByTag(x, field: untyped; b0 = 0): untyped: So-called tag sort used by nsortBy for large objects (or directly by users who know what they are doing). Source Edit

adix/nsort

Imports

Types

Vars

Procs

Templates