High-performance FZF-style subsequence matching implementation
Algorithm Overview
This implements a single-pass subsequence matcher optimized for interactive completion. Characters from the pattern must appear in the candidate string in the same order, but not necessarily consecutively.
Example Matches
Pattern "mlnd" matches:
- "mailinglists/neomutt-dev" (high score: matches at word boundaries)
- "mailing_list_node" (medium score: some boundaries)
- "my_long_nested_dir" (lower score: scattered matches)
Pattern "inb" matches:
- "INBOX" (highest: start of string + consecutive)
- "Archive/INBOX" (medium: not at root)
- "mailbox/inbox-archive" (lower: gaps between matches)
UTF-8 Support
The matcher is UTF-8 aware but uses ASCII-only case folding:
- Byte-wise matching: Treats strings as sequences of bytes
- ASCII case folding: Only A-Z are folded to a-z
- UTF-8 preservation: Multi-byte UTF-8 sequences are never split
- ASCII boundaries: Only ASCII separators (/.-_) get boundary bonuses
This approach provides:
- ✓ Fast performance (no Unicode decoding overhead)
- ✓ Correct UTF-8 handling (multi-byte sequences preserved)
- ✓ ASCII case-insensitive matching (works for English text)
- ✓ UTF-8 matching works (as byte sequences)
Examples:
- "inbox" matches "INBOX" (ASCII folding)
- "café" matches "Café" (exact bytes, no case folding on é)
- "mail" matches "郵件/mail/box" (ASCII substring matches)
- "郵件" matches "郵件/mail" (exact byte sequence)
Note: Non-ASCII characters are matched case-sensitively as byte sequences. This is intentional for performance and simplicity.
Scoring Rules
The algorithm assigns scores based on multiple factors:
Base Score
- Each matched character: +10 points
Bonuses
- Start of string: +30 points (e.g., "I" in "INBOX")
- After separator ('/', '.', '-', '_'): +15 points
- Prefix match (when prefer_prefix=true): +40 points
- Consecutive matches: +15 points per consecutive char (e.g., "box" all together in "mailbox")
- CamelCase boundary: +10 points (e.g., "M" in "MyMailbox") (Only detects ASCII A-Z boundaries)
Penalties
- Gaps between matches: -2 points per character gap (encourages compact matches)
- Total span: -1 point per character in match span (first to last matched char)
- String length: -length/4 points (slightly favors shorter candidates)
Smart Case Matching
By default, matching is case-insensitive (ASCII only). With smart_case enabled:
- All-lowercase pattern → case-insensitive (ASCII)
- Pattern with uppercase → case-sensitive
Note: Smart case only examines ASCII characters (A-Z).
Performance Characteristics
- Time complexity: O(n) where n = length of candidate string
- Space complexity: O(m) where m = length of pattern (stack only)
- No heap allocation: Uses fixed-size stack array for match positions
- No backtracking: Single forward pass through candidate
- No recursion: Purely iterative
- No UTF-8 decoding: Byte-wise comparison for maximum speed
This makes it suitable for interactive use even with thousands of candidates.
Why These Rules?
The scoring model is optimized for hierarchical paths and structured strings:
- Boundary bonuses help match path components and words
- Consecutive bonuses reward compact substring matches
- Gap penalties discourage scattered character matches
- Span penalties favor matches clustered together
- Length penalties prevent long strings from dominating
This produces intuitive rankings for mailbox paths, commands, and names.