Source Edit

LgHisto is an application of BISTs to log-spaced histograms that gives efficient, dynamic quantiles. log-spacing gives high dynamic range with low cost while Fenwick/BIST supports dynamic membership w/operation-balanced perf.

Quantile error is absolute { not relative to q*(1-q) like t-Digests } & easily bounded as <~ 1/2 bin width { about 10^(lg(b/a)/n/2) }. If you need 3 places or your data is clustered within a few orders of magnitude then you can likely just use 1e4 bins & counters will remain L1 cache resident, depending on resource competition. Cache is the main cost Re: speed. Re: space, since 99% of bins are 0 in many cases, simple run-length encoding can improve net/disk transfers.

The way Fenwick BISTs work, the generic parameter C must be a wide enough integer type to hold both elemental bin counts AND grand totals. uint32 is likely enough for many applications, though some might sneak by with uint16 and a few might need uint64. This scales bin size/space cost.

t-Digests are a well marketed competitor using ~10X less space BUT with >>5X slower quantiles of similar accuracy. Actual cost is sensitive to operation mixes. { This may change, but present t-Digest impls, even with trees, linear scan for quantile/CDFs. None even offer "batch" APIs to do N quantiles in one such scan. "Histo B-trees" should allow better scaling for such. } A BIST basis differs from t-Digests in other important ways. First, BISTs are well suited for pop or moving data window operations with strict finite memory, for e.g. translation of full streams to moving quantiles as in Bollinger Band-style smooths. Second, floating point weights for EWMA-like decaying memory are not possible since FP arithmetic kind of breaks BISTs.

Imports

bist, lna

Types

LgHisto[C] = object: Log-spaced histogram with Bist[C] backing Source Edit

Procs

proc `$`[C](s: LgHisto[C]; nonZero = true): string: Formatting operator; Warning: output can be large, esp. if nonZero=false Source Edit
func add[F, C](s: var LgHisto[C]; x: F; w = 1.C): Increment bin for value x by weight w Source Edit
func binAB[F, C](s: LgHisto[C]; x: F): (float, float): Range in data space of the bin containing x; Costs 2 fromIxs. Source Edit
func bist[C](s: LgHisto[C]): Bist[C]: Source Edit
func cdf[F, C](s: LgHisto[C]; x: F): C: Raw count; Leave to caller to multiply by 1/s.bist.count; XXX Interpolate? Source Edit
func fromIx[F, C](s: LgHisto[C]; i: int; offset: F = 0.5): F: Geometric mean of left&right edge log-shifted offset fraction into bin Source Edit
func high[C](s: LgHisto[C]): float: Source Edit
func init[C](s: var LgHisto[C]; a = 1e-16; b = 1e+20; n = 8300): Init histo w/2n+1 log-spaced bins: [-∞..-b; -b..-a; 0; a..<b; b..∞]. Source Edit
func initLgHisto[C](a = 1e-16; b = 1e+20; n = 8300): LgHisto[C]: Get Histo w/2n+1 log-spaced bins: [-inf..<-b; -b..<-a; 0; a..<b; b..inf]. Source Edit
func low[C](s: LgHisto[C]): float: Source Edit
func merge[C](dst: var LgHisto[C]; src: LgHisto[C]): Merge counts from src into dst. Source Edit
func nBin[C](s: LgHisto[C]): int: Source Edit
func overflows[C](s: LgHisto[C]): int: Source Edit
func pop[F, C](s: var LgHisto[C]; x: F; w = 1.C): Alias for add with a negative weight argument Source Edit
func quantile[F, C](s: LgHisto[C]; q: F): F: Basic quantile; XXX Can log-spacing savvy interpolation be more accurate? Source Edit
func space[C](s: LgHisto[C]): int: Estimate space taken up by data structure in bytes Source Edit
func toIx[F, C](s: LgHisto[C]; x: F): int: Find bin index for value x; underflows get [0] & overflows get [2*n]. Source Edit
func underflows[C](s: LgHisto[C]): int: Source Edit

Iterators

iterator bins[C](s: LgHisto[C]): (float, float, C): Yield (lo, hi, count) for each bin covered Source Edit

adix/lghisto

Imports

Types

Procs

Iterators