adix/lghisto

Source   Edit  

LgHisto is an application of BISTs to log-spaced histograms that gives efficient, dynamic quantiles. log-spacing gives high dynamic range with low cost while Fenwick/BIST supports dynamic membership w/operation-balanced perf.

Quantile error is absolute { not relative to q*(1-q) like t-Digests } & easily bounded as <~ 1/2 bin width { about 10^(lg(b/a)/n/2) }. If you need 3 places or your data is clustered within a few orders of magnitude then you can likely just use 1e4 bins & counters will remain L1 cache resident, depending on resource competition. Cache is the main cost Re: speed. Re: space, since 99% of bins are 0 in many cases, simple run-length encoding can improve net/disk transfers.

The way Fenwick BISTs work, the generic parameter C must be a wide enough integer type to hold both elemental bin counts AND grand totals. uint32 is likely enough for many applications, though some might sneak by with uint16 and a few might need uint64. This scales bin size/space cost.

t-Digests are a well marketed competitor using ~10X less space BUT with >>5X slower quantiles of similar accuracy. Actual cost is sensitive to operation mixes. { This may change, but present t-Digest impls, even with trees, linear scan for quantile/CDFs. None even offer "batch" APIs to do N quantiles in one such scan. "Histo B-trees" should allow better scaling for such. } A BIST basis differs from t-Digests in other important ways. First, BISTs are well suited for pop or moving data window operations with strict finite memory, for e.g. translation of full streams to moving quantiles as in Bollinger Band-style smooths. Second, floating point weights for EWMA-like decaying memory are not possible since FP arithmetic kind of breaks BISTs.

Types

LgHisto[C] = object
Log-spaced histogram with Bist[C] backing Source   Edit  

Procs

proc `$`[C](s: LgHisto[C]; nonZero = true): string
Formatting operator; Warning: output can be large, esp. if nonZero=false Source   Edit  
func add[F, C](s: var LgHisto[C]; x: F; w = 1.C)
Increment bin for value x by weight w Source   Edit  
func binAB[F, C](s: LgHisto[C]; x: F): (float, float)
Range in data space of the bin containing x; Costs 2 fromIxs. Source   Edit  
func bist[C](s: LgHisto[C]): Bist[C]
Source   Edit  
func cdf[F, C](s: LgHisto[C]; x: F): C
Raw count; Leave to caller to multiply by 1/s.bist.count; XXX Interpolate? Source   Edit  
func fromIx[F, C](s: LgHisto[C]; i: int; offset: F = 0.5): F
Geometric mean of left&right edge log-shifted offset fraction into bin Source   Edit  
func high[C](s: LgHisto[C]): float
Source   Edit  
func init[C](s: var LgHisto[C]; a = 1e-16; b = 1e+20; n = 8300)
Init histo w/2n+1 log-spaced bins: [-∞..-b; -b..-a; 0; a..<b; b..∞]. Source   Edit  
func initLgHisto[C](a = 1e-16; b = 1e+20; n = 8300): LgHisto[C]
Get Histo w/2n+1 log-spaced bins: [-inf..<-b; -b..<-a; 0; a..<b; b..inf]. Source   Edit  
func low[C](s: LgHisto[C]): float
Source   Edit  
func merge[C](dst: var LgHisto[C]; src: LgHisto[C])
Merge counts from src into dst. Source   Edit  
func nBin[C](s: LgHisto[C]): int
Source   Edit  
func overflows[C](s: LgHisto[C]): int
Source   Edit  
func pop[F, C](s: var LgHisto[C]; x: F; w = 1.C)
Alias for add with a negative weight argument Source   Edit  
func quantile[F, C](s: LgHisto[C]; q: F): F
Basic quantile; XXX Can log-spacing savvy interpolation be more accurate? Source   Edit  
func space[C](s: LgHisto[C]): int
Estimate space taken up by data structure in bytes Source   Edit  
func toIx[F, C](s: LgHisto[C]; x: F): int
Find bin index for value x; underflows get [0] & overflows get [2*n]. Source   Edit  
func underflows[C](s: LgHisto[C]): int
Source   Edit  

Iterators

iterator bins[C](s: LgHisto[C]): (float, float, C)
Yield (lo, hi, count) for each bin covered Source   Edit