cherry-studio/docs/en/references/fuzzy-search.md
beyondkmp bc9eeb9f30
feat: add fuzzy search for file list with relevance scoring (#12131)
* feat: add fuzzy search for file list with relevance scoring

- Add fuzzy option to DirectoryListOptions (default: true)
- Implement isFuzzyMatch for subsequence matching
- Add getFuzzyMatchScore for relevance-based sorting
- Remove searchByContent method (content-based search)
- Increase maxDepth to 10 and maxEntries to 20

* perf: optimize fuzzy search with ripgrep glob pre-filtering

- Add queryToGlobPattern to convert query to glob pattern
- Use ripgrep --iglob for initial filtering instead of loading all files
- Reduces memory footprint and improves performance for large directories

* feat: add greedy substring match fallback for fuzzy search

- Add isGreedySubstringMatch for flexible matching
- Fallback to greedy match when glob pre-filter returns empty
- Allows 'updatercontroller' to match 'updateController.ts'

* fix: improve greedy substring match algorithm

- Search from longest to shortest substring for better matching
- Fix issue where 'updatercontroller' couldn't match 'updateController'

* docs: add fuzzy search documentation (en/zh)

* refactor: extract MAX_ENTRIES_PER_SEARCH constant

* refactor: use logarithmic scaling for path length penalty

- Replace linear penalty (0.8 * length) with logarithmic scaling
- Prevents long paths from dominating the score
- Add PATH_LENGTH_PENALTY_FACTOR constant with explanation

* refactor: extract scoring constants with documentation

- Add named constants for scoring factors (SCORE_SEGMENT_MATCH, etc.)
- Update en/zh documentation with scoring strategy explanation

* refactor: move PATH_LENGTH_PENALTY_FACTOR to class level constant

* refactor: extract buildRipgrepBaseArgs helper method

- Reduce code duplication for ripgrep argument building
- Consolidate directory exclusion patterns and depth handling

* refactor: rename MAX_ENTRIES_PER_SEARCH to MAX_SEARCH_RESULTS

* fix: escape ! character in glob pattern for negation support

* fix: avoid duplicate scoring for filename starts and contains

* docs: clarify fuzzy search filtering and scoring strategies

* fix: limit word boundary bonus to single match

* fix: add dedicated scoring for greedy substring match

- Add getGreedyMatchScore function that rewards fewer fragments and tighter matches
- Add isFuzzyMatch validation before scoring in fuzzy glob path
- Use greedy scoring for fallback path to properly rank longest matches first

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

---------

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-12-30 19:42:56 +08:00

4.6 KiB

Fuzzy Search for File List

This document describes the fuzzy search implementation for file listing in Cherry Studio.

Overview

The fuzzy search feature allows users to find files by typing partial or approximate file names/paths. It uses a two-tier file filtering strategy (ripgrep glob pre-filtering with greedy substring fallback) combined with subsequence-based scoring for optimal performance and flexibility.

Features

  • Ripgrep Glob Pre-filtering: Primary filtering using glob patterns for fast native-level filtering
  • Greedy Substring Matching: Fallback file filtering strategy when ripgrep glob pre-filtering returns no results
  • Subsequence-based Segment Scoring: During scoring, path segments gain additional weight when query characters appear in order
  • Relevance Scoring: Results are sorted by a relevance score derived from multiple factors

Matching Strategies

1. Ripgrep Glob Pre-filtering (Primary)

The query is converted to a glob pattern for ripgrep to do initial filtering:

Query: "updater"
Glob:  "*u*p*d*a*t*e*r*"

This leverages ripgrep's native performance for the initial file filtering.

2. Greedy Substring Matching (Fallback)

When the glob pre-filter returns no results, the system falls back to greedy substring matching. This allows more flexible matching:

Query: "updatercontroller"
File:  "packages/update/src/node/updateController.ts"

Matching process:
1. Find "update" (longest match from start)
2. Remaining "rcontroller" → find "r" then "controller"
3. All parts matched → Success

Scoring Algorithm

Results are ranked by a relevance score based on named constants defined in FileStorage.ts:

Constant Value Description
SCORE_FILENAME_STARTS 100 Filename starts with query (highest priority)
SCORE_FILENAME_CONTAINS 80 Filename contains exact query substring
SCORE_SEGMENT_MATCH 60 Per path segment that matches query
SCORE_WORD_BOUNDARY 20 Query matches start of a word
SCORE_CONSECUTIVE_CHAR 15 Per consecutive character match
PATH_LENGTH_PENALTY_FACTOR 4 Logarithmic penalty for longer paths

Scoring Strategy

The scoring prioritizes:

  1. Filename matches (highest): Files where the query appears in the filename are most relevant
  2. Path segment matches: Multiple matching segments indicate stronger relevance
  3. Word boundaries: Matching at word starts (e.g., "upd" matching "update") is preferred
  4. Consecutive matches: Longer consecutive character sequences score higher
  5. Path length: Shorter paths are preferred (logarithmic penalty prevents long paths from dominating)

Example Scoring

For query updater:

File Score Factors
RCUpdater.js Short path + filename contains "updater"
updateController.ts Multiple segment matches
UpdaterHelper.plist Long path penalty

Configuration

DirectoryListOptions

interface DirectoryListOptions {
  recursive?: boolean      // Default: true
  maxDepth?: number        // Default: 10
  includeHidden?: boolean  // Default: false
  includeFiles?: boolean   // Default: true
  includeDirectories?: boolean // Default: true
  maxEntries?: number      // Default: 20
  searchPattern?: string   // Default: '.'
  fuzzy?: boolean          // Default: true
}

Usage

// Basic fuzzy search
const files = await window.api.file.listDirectory(dirPath, {
  searchPattern: 'updater',
  fuzzy: true,
  maxEntries: 20
})

// Disable fuzzy search (exact glob matching)
const files = await window.api.file.listDirectory(dirPath, {
  searchPattern: 'update',
  fuzzy: false
})

Performance Considerations

  1. Ripgrep Pre-filtering: Most queries are handled by ripgrep's native glob matching, which is extremely fast
  2. Fallback Only When Needed: Greedy substring matching (which loads all files) only runs when glob matching returns empty results
  3. Result Limiting: Only top 20 results are returned by default
  4. Excluded Directories: Common large directories are automatically excluded:
    • node_modules
    • .git
    • dist, build
    • .next, .nuxt
    • coverage, .cache

Implementation Details

The implementation is located in src/main/services/FileStorage.ts:

  • queryToGlobPattern(): Converts query to ripgrep glob pattern
  • isFuzzyMatch(): Subsequence matching algorithm
  • isGreedySubstringMatch(): Greedy substring matching fallback
  • getFuzzyMatchScore(): Calculates relevance score
  • listDirectoryWithRipgrep(): Main search orchestration