adix/uniqce

Source   Edit  

The min-count sketch (NOT count-min) idea is to see hash(x) as a U(0,1) & use P(sampleMax<x)=x^n for sample size n. Low art inverts a confidence.ival for h.max to estimate n. Tracking k-most distinct h gives better accuracy and is usually called a KMV sketch. (Intuition is that k-th edge val => average gap between k-1 uniques&averaging cuts noise.) See Bar-Yossef 2002 "Counting Distinct..", Giroire05 "Order statistics & estimating cardinalities" & Ting14 "Streamed approximate counting..".

Types

UniqCe[F] = object
Source   Edit  

Procs

proc initUniqCe[F: SomeFloat](k = 1024): UniqCe[F]
Return initialized UniqCe with tail size k. k=1024 costs 4K|1VM page Source   Edit  
proc jaccard[F: SomeFloat](a: UniqCe[F]; b: UniqCe[F]): float32
Exact Jaccard coefficient (|intersect|/|union|) of the two sample tails. NOTE: jaccard(a,b) * union(a, b).nUnique estimates nUnique(a ^ b). Source   Edit  
proc nUnique[F: SomeFloat](uc: UniqCe[F]): float32
Estimate number of unique elements seen so far. Source   Edit  
proc nUniqueErr[F: SomeFloat](uc: UniqCe[F]): float32
Estimated error on estimate of unique elements seen so far. Source   Edit  
proc push[F: SomeFloat](uc: var UniqCe[F]; h: F)
Incorporate visitation of an element. NOTE: std/hashes.hash(string) OUTPUT ONLY FILLS 32-BITS => push takes x on [0,1] instead of a key|hash() val. Source   Edit  
proc union[F: SomeFloat](a: UniqCe[F]; b: UniqCe[F]): UniqCe[F]
Return merge/union of a & b. Source   Edit  
proc union[F: SomeFloat](a: var UniqCe[F]; b: UniqCe[F])
Push all hashes from b onto a effecting an estimated set union. Source   Edit