adix/lmbist

Source   Edit  

Bist[T] (& clients lghisto, mvstat) already support quantiles over moving data windows. Sometimes one wants recent values to carry more weight in summary statistics, as in a Linearly Weighted Moving Average. In the context of a distribution, one weights by replication - more copies of more recent points vs. earlier & earlier. While one can do this by literal data point repetition, that expands both space & time costs. So, one should prefer virtual replication

  • a histogram putting weight into bins with the same time structure.

A naive implementation decays weight for each point in the window as each point leaves and adds a new point with weight w. This is O(nBin) - no faster than a full rebuild. This CAN be done with no loops longer than lg(nBin), though. The key insight is to "count up forever" adding in new points with weight t+1, but subtract a virtual zero level. If no actual duplicates exist, this 0-level is simply t-w. Each duplicate @various lags (common w/binning, e.g. lghisto) gets 1 "copy" of this virtual 0-level - which "stack up" in a bin. This can be handled with a second Bist tracking only membership. The actual distro is then the linear combination cnt[] - zero*nLag[].

We thus get LMBist[T] which bundles up this dual Bist[T] idea. The API is the same as Bist[T] & EMBist[T]. The bottom of this module has a small test/timing prog showing the differences.

Happy to cite someone, but as far as I can tell, this is a completely novel application of Fenwick BISTs for a fast Linear weight Moving Median filter transform. I certainly came up with it on my own. The linearly weighted moving median itself while mostly obvious from its term and the more famous LWMA is all but unheard of in academic literature regardless of implementation efficiency. Khodakarami, et. al 2019 about Parkinson's is literally THE ONLY match on scholar.google.com as of this writing. Cohen & Strauss were likely somehow aware of the idea, but unaware of the "LWM Average" term when they decided to name this "chordal weighting" in their Cohen,Strauss2003 SIGMOD paper. Due to its simplicity (see adix/embist) & especially terse average form, exponential time kernels SO dominate that other things are usually unattended. See, e.g., Akinshin 2023 "Weighted Quantile Estimators" which does not consider linear time kernels (but is otherwise a very interesting paper). It is true that one needs to keep just one window of data points in a std/deque|ring buffer to expire items, but the same is true of exponential weighting with strict windows, needed for good history/"time breakdown point" behavior in either average or quantile settings. It is only fairly extreme scenarios where just one window exceeds CPU L3 cache, let alone DRAM, though. Please cite this github repo if this code inspires your work.

Types

LMBist[T] = object
Source   Edit  

Procs

proc cdf[T](d: LMBist[T]; i: int): T
Source   Edit  
proc clear[T](d: var LMBist[T])
Source   Edit  
proc count[T](d: LMBist[T]): T
Source   Edit  
proc dec[T](d: var LMBist[T]; i: int; w: T)
Source   Edit  
proc inc[T](d: var LMBist[T]; i: int; w: T)
Source   Edit  
proc init[T](d: var LMBist[T]; len: int)
Source   Edit  
proc initLMBist[T](len: int): LMBist[T]
Source   Edit  
proc invCDF[T](d: LMBist[T]; s: T; s0, s1: var T): int
Source   Edit  
proc invCDF[T](d: LMBist[T]; s: T; s0: var T): int
Source   Edit  
proc len[T](d: LMBist[T]): int
Source   Edit  
proc max[T](d: LMBist[T]): int
Simple wrapper of d.nLag.max. Source   Edit  
proc min[T](d: LMBist[T]): int
Simple wrapper of d.nLag.min. Source   Edit  
proc nCDF[T](d: LMBist[T]): seq[float32]
Source   Edit  
proc nPDF[T](d: LMBist[T]): seq[float32]
Source   Edit  
proc pmf[T](d: LMBist[T]; i: int): T
Source   Edit  
proc quantile[T](d: LMBist[T]; q: float): float
Source   Edit  
proc quantile[T](d: LMBist[T]; q: float; iL, iH: var int): float
Source   Edit  
func space[T](d: LMBist[T]): int
Source   Edit  
proc tot[T](d: LMBist[T]): T
Source   Edit  
proc up[T](d: var LMBist[T])
Simple no-op for LMBist Source   Edit