cligen/abbrev

This module is about building shell-* wildcard abbreviations for sets of strings. The simplest variant is when the user gives a specific maximum string length and at least one component of the head,tail slices. In that case, abbrev just wraps Nim string slicing and the set of abbreviations may or may not uniquely cover the input string set.

When users leave certain aspects automatic/undefined, various algorithms can be deployed to find optimized answers within various constraints. The likely most important constraint is uniqueness. Can one copy-paste strings into shell REPLs and expect each to expand to exactly 1 item? The same concept applies to non-filename strings, but there is less help from shells to auto-expand. Eg., 10-40 user names on a lightly configured single-user Unix system might expand using "a directory in your brain" (or you can write a small expander script to spot-check anything non-obvious in context).

The simplest automatic setting is a fixed wildcard position protocol such as a specific head or mid-point with binary search quickly finding a smallest max limit which makes abbreviations unique. Visually this results in columns of abbreviations where the wildcard *s line up vertically which seems easier to read and is most similar to an ellipsis (...) separator the NeXT cube file browser used.

As a next level of terminal-space optimization, we can ignore user head,tail specs, instead finding the location that minimizes each width of the string set. This code does that by just trying all possible locations for '*'. This starts to be a lot of brute force work and slow for a computer but remains fast for a human (eg. 400 ms on a directory of 12,000 entries).

The next level of optimization/data compression is to allow the location of the wildcard to vary from string to string. After that, allowing >1 '*' can continue to shorten strings. Each optimization level removes more context making strings harder to read & gets slower to compute. Meanwhile, tabular contexts often decide (after initial lengths are known) padding space to align text visually. That space can be repurposed to partially expand patterns which then eases reading without loss of terminal rows which here we call expandFit.

This algo research area seems neglected, the closest I could find being Minimal Distinguishing Subsequence Patterns by Ji, Bailey & Dong in 2007 in a bioinformatics context.

Types

Abbrev = object
  mx*, hd, tl, sLen, m, h, t: int

Consts

parseAbbrevHelp = """a\*|M,head(M/2),tail(M-hdSep),sep(\*),?chars
a:bestPos -2:pfx -3:sfx -4:mfx -5:1\* -6:2\*
POSITIVE_NUMBER=thatWidth/head/tail"""

Procs

proc abbrev(a: Abbrev; str: string): string {.inline, ...raises: [KeyError],
    tags: [], forbids: [].}
Abbreviate str as str[0..<hd], sep, str[^tl..^1] only if str.len > mx.
proc expandFit(a: var Abbrev; strs: var seq[string];
               ab0, ab1, wids, colWs: var seq[int]; w, jP, m, nr, nc: int) {.
    ...raises: [KeyError], tags: [], forbids: [].}
Expand any a.sep in strs[m*i + jP][ab0[i] ..< ab1[i]], updating colWs[m*(i div nr) + jP] until all seps gone or colWs.sum==w. I.e. colWs include gap to right. Overall table structure is preserved. Early a.sep instances are fully expanded before later instances change.
proc isAbstract(a: Abbrev): bool {.inline, ...raises: [], tags: [], forbids: [].}
proc parseAbbrev(s: string): Abbrev {....raises: [ValueError], tags: [],
                                      forbids: [].}
Parse comma-separated abbreviation spec [mx][,[hd][,[tl][,[sep]]]] s into Abbrev abb. Non-numeric mx => -1 => caller must set mx and call update. Non-numeric|missing hd => mx-sep.len-tl Non-numeric or missing tl => mx-sep.len-hd. Non-num|missing both => hd=(mx-sep.len)/2; tl=mx-sep.len-hd (which gives tl 1 more for odd mx-sep.len). mx < 0 => various locally optimized uniqueAbbrevs.
proc realize(a: var Abbrev; strs: openArray[string]; jobs = 1; jobsN = 150) {.
    ...raises: [ResourceExhaustedError, KeyError], tags: [], forbids: [].}
Semi-efficiently find the smallest max such that strs can be uniquely abbreviated by abbrev(s, mx, hd, tl) for all s in strs.
proc realize[T](a: var Abbrev; tab: Table[T, string]; jobs = 1; jobsN = 150)
Find smallest max s.t. abbrev unique over values of tab.
proc uniqueAbbrevs(a: var Abbrev; strs: openArray[string]; jobs = 1; jobsN = 150): seq[
    string] {....raises: [ResourceExhaustedError], tags: [], forbids: [].}
Return narrowest unique abbrevation set for strs given some number of wildcards (sep, probably *), where both location and number of wildcards can vary from string to string.
proc update(a: var Abbrev) {.inline, ...raises: [], tags: [], forbids: [].}
Call this after any change to a.mx to update derived quantities.