adix/btree

Source   Edit  

I know of no other as-simple/general FOSS B-Tree (in any prog.langs|books). Theory people recurse; DB code clutters w/sync; Other APIs are less general. Correct me if I am wrong/cite this if it be your starting point. Little effort is made to explain these algos in comments as it's impractical to cram a course into this file with no figures. More details & resources are at https://en.wikipedia.org/wiki/B-tree or in Graefe & Kuno 2011.

This module defines a template that defines procs on a pretty general B-Tree. The tree can be positional-only, keyed-only or keyed-ranked, be either set of keyed rows or (key,value)-style, have its nodes allocated in any way (via abstract ptrs & deref, eg. on disk via memfiles, via node pools hanging off another object, or GC'd refs), & manage dup keys either inlined into the structure or handled externally (within the same rank space!). This is >36 (3rk*2kv*3alloc*2du) styles from 1 instantiation harness. The template has many parameters to control all these choices. All but Ob are defaulted:

  • K: Key type; Caller provides getKey(Ob):K; void if positional-only
  • Pos: type for node weight/position augmentation; void if keyed-only
  • ObSize: type for dup chain size if handled externally; void if embedded
  • nodeSize: size in bytes of a node. B-Tree Order m is inferred from this. Our notation is m..2m links split by m-1..2m-1 obs; 2 => a 234-tree; 2*m**(h-1)-1 <= numNodes <= (2m)**h-1 where h=height of tree.
  • Ix: type for object/link indices into node arrays
  • Ln: type for the link array in nodes

Common notation/abbreviations in this code/APiS:

  • s|side: 0|1 bool side of an op. Eg. successor = path.seekAdj(true).
  • Ob|ob: abstract object type being stored or an instance/array of it.
  • Ln|ln: abstract link type being stored or an instance/array of it.
  • allocProcs (btPool|btRef|btFile): allocation style for outer Nodes
  • nodeProcs (btLinearNode|btNestedNode): in-Node organization style

Notable implementation choices:

There is no "Tree" type distinct from the "SubTree/Node" type. Once made a root node never moves. That root address is the only handle needed. "nil" ptrs (in whatever allocation arena is used) are just empty trees. Because linear ordering always has exactly 2 sides, parameterization into s|side often keeps life simple/organized (cf. Mehlhorn DS&Algos|common sense).

Routines are all non-recursive. Instead a Path object is central to the API & we clearly separate cursor manipulation from mutation. This also makes the 3 main styles (ranked-only, keyed-only, keyed-ranked) symmetric & removes recursion overhead (big for tall trees/small m). Each instance can be a multiset/table with "sided" edits (stack/queue) of duplicate key series.

Victim replacement selection in internal deletes biases toward uniform node occupancy rather than minimum node count. The bulk loader to build a minimum height tree from pre-ordered inputs also allows leaving 1 (generalizable to x?) spare slot in each node to speed early inserts later. A property check routine is provided for would-be extenders. There is presently no provision for for concurrent access as the focus is just a good single-threaded tree.

Limitations/future work:

One limitation is that leaf&internal nodes have the same size&representation, wasting ~4..8*m B/leaf. This is <33% waste for Ob >=8..16 bytes. (Post aging, occupancy =~69% anyway.) This cost is spent to avoid complexity of either two node allocation pools with dynamic conversion or different-m orders for leaf & internal nodes. Max data density means wider 4..7 node split-merges (not 2..3) & specializing on key types, anyway; Eg. for string keys not duplicating prefixes in a leaf between ("aab1", "aab2").

This is a work in progress. Algorithmically, we should (maybe) do

  1. Node Structure (eg. RB tree w/1B ptr); moveMem has good const factors, but even DRAM has ideal node sz ~ 70ns*40GB/s = 2800B/(2+4)=~466
  2. Whole tree catenate (helpful for seq style &)
  3. More complex whole tree split
  4. Optimizations for giant dup blocks

Nim TODOs incl: make easy to include inside other generic types, add easy HashSet & Table use in terms of this lower-level core, run-tm ->CT errs, do GC'd&memfiles Ln variants, do distinct int Ln for ovrld/figure out exports.

Procs

proc orderFit(target, wtSz, ixSz, obSz, lnSz: int): int {....raises: [], tags: [],
    forbids: [].}
Source   Edit  

Templates

template btLinearNode(Ob, K, Pos: type; M: untyped; Ix, Ln: type)
Node ops Source   Edit  
template btPool(Ob, Pos: type; nodeSize: int; Ix, Ln: type)
A simple pool allocator. Easily adapted to memfiles | GC'd allocators. Source   Edit  
template defBTree(Ob: type; K: type = void; Pos: type = void;
                  ObSize: type = void; nodeSize: int = 16; Ix: type = int16;
                  Ln: type = uint16; allocProcs = btPool;
                  nodeProcs = btLinearNode)
Source   Edit