mirror of
https://github.com/CherryHQ/cherry-studio.git
synced 2026-01-05 04:19:02 +08:00
* feat: add fuzzy search for file list with relevance scoring - Add fuzzy option to DirectoryListOptions (default: true) - Implement isFuzzyMatch for subsequence matching - Add getFuzzyMatchScore for relevance-based sorting - Remove searchByContent method (content-based search) - Increase maxDepth to 10 and maxEntries to 20 * perf: optimize fuzzy search with ripgrep glob pre-filtering - Add queryToGlobPattern to convert query to glob pattern - Use ripgrep --iglob for initial filtering instead of loading all files - Reduces memory footprint and improves performance for large directories * feat: add greedy substring match fallback for fuzzy search - Add isGreedySubstringMatch for flexible matching - Fallback to greedy match when glob pre-filter returns empty - Allows 'updatercontroller' to match 'updateController.ts' * fix: improve greedy substring match algorithm - Search from longest to shortest substring for better matching - Fix issue where 'updatercontroller' couldn't match 'updateController' * docs: add fuzzy search documentation (en/zh) * refactor: extract MAX_ENTRIES_PER_SEARCH constant * refactor: use logarithmic scaling for path length penalty - Replace linear penalty (0.8 * length) with logarithmic scaling - Prevents long paths from dominating the score - Add PATH_LENGTH_PENALTY_FACTOR constant with explanation * refactor: extract scoring constants with documentation - Add named constants for scoring factors (SCORE_SEGMENT_MATCH, etc.) - Update en/zh documentation with scoring strategy explanation * refactor: move PATH_LENGTH_PENALTY_FACTOR to class level constant * refactor: extract buildRipgrepBaseArgs helper method - Reduce code duplication for ripgrep argument building - Consolidate directory exclusion patterns and depth handling * refactor: rename MAX_ENTRIES_PER_SEARCH to MAX_SEARCH_RESULTS * fix: escape ! character in glob pattern for negation support * fix: avoid duplicate scoring for filename starts and contains * docs: clarify fuzzy search filtering and scoring strategies * fix: limit word boundary bonus to single match * fix: add dedicated scoring for greedy substring match - Add getGreedyMatchScore function that rewards fewer fragments and tighter matches - Add isFuzzyMatch validation before scoring in fuzzy glob path - Use greedy scoring for fallback path to properly rank longest matches first Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com> --------- Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
130 lines
4.6 KiB
Markdown
130 lines
4.6 KiB
Markdown
# Fuzzy Search for File List
|
|
|
|
This document describes the fuzzy search implementation for file listing in Cherry Studio.
|
|
|
|
## Overview
|
|
|
|
The fuzzy search feature allows users to find files by typing partial or approximate file names/paths. It uses a two-tier file filtering strategy (ripgrep glob pre-filtering with greedy substring fallback) combined with subsequence-based scoring for optimal performance and flexibility.
|
|
|
|
## Features
|
|
|
|
- **Ripgrep Glob Pre-filtering**: Primary filtering using glob patterns for fast native-level filtering
|
|
- **Greedy Substring Matching**: Fallback file filtering strategy when ripgrep glob pre-filtering returns no results
|
|
- **Subsequence-based Segment Scoring**: During scoring, path segments gain additional weight when query characters appear in order
|
|
- **Relevance Scoring**: Results are sorted by a relevance score derived from multiple factors
|
|
|
|
## Matching Strategies
|
|
|
|
### 1. Ripgrep Glob Pre-filtering (Primary)
|
|
|
|
The query is converted to a glob pattern for ripgrep to do initial filtering:
|
|
|
|
```
|
|
Query: "updater"
|
|
Glob: "*u*p*d*a*t*e*r*"
|
|
```
|
|
|
|
This leverages ripgrep's native performance for the initial file filtering.
|
|
|
|
### 2. Greedy Substring Matching (Fallback)
|
|
|
|
When the glob pre-filter returns no results, the system falls back to greedy substring matching. This allows more flexible matching:
|
|
|
|
```
|
|
Query: "updatercontroller"
|
|
File: "packages/update/src/node/updateController.ts"
|
|
|
|
Matching process:
|
|
1. Find "update" (longest match from start)
|
|
2. Remaining "rcontroller" → find "r" then "controller"
|
|
3. All parts matched → Success
|
|
```
|
|
|
|
## Scoring Algorithm
|
|
|
|
Results are ranked by a relevance score based on named constants defined in `FileStorage.ts`:
|
|
|
|
| Constant | Value | Description |
|
|
|----------|-------|-------------|
|
|
| `SCORE_FILENAME_STARTS` | 100 | Filename starts with query (highest priority) |
|
|
| `SCORE_FILENAME_CONTAINS` | 80 | Filename contains exact query substring |
|
|
| `SCORE_SEGMENT_MATCH` | 60 | Per path segment that matches query |
|
|
| `SCORE_WORD_BOUNDARY` | 20 | Query matches start of a word |
|
|
| `SCORE_CONSECUTIVE_CHAR` | 15 | Per consecutive character match |
|
|
| `PATH_LENGTH_PENALTY_FACTOR` | 4 | Logarithmic penalty for longer paths |
|
|
|
|
### Scoring Strategy
|
|
|
|
The scoring prioritizes:
|
|
1. **Filename matches** (highest): Files where the query appears in the filename are most relevant
|
|
2. **Path segment matches**: Multiple matching segments indicate stronger relevance
|
|
3. **Word boundaries**: Matching at word starts (e.g., "upd" matching "update") is preferred
|
|
4. **Consecutive matches**: Longer consecutive character sequences score higher
|
|
5. **Path length**: Shorter paths are preferred (logarithmic penalty prevents long paths from dominating)
|
|
|
|
### Example Scoring
|
|
|
|
For query `updater`:
|
|
|
|
| File | Score Factors |
|
|
|------|---------------|
|
|
| `RCUpdater.js` | Short path + filename contains "updater" |
|
|
| `updateController.ts` | Multiple segment matches |
|
|
| `UpdaterHelper.plist` | Long path penalty |
|
|
|
|
## Configuration
|
|
|
|
### DirectoryListOptions
|
|
|
|
```typescript
|
|
interface DirectoryListOptions {
|
|
recursive?: boolean // Default: true
|
|
maxDepth?: number // Default: 10
|
|
includeHidden?: boolean // Default: false
|
|
includeFiles?: boolean // Default: true
|
|
includeDirectories?: boolean // Default: true
|
|
maxEntries?: number // Default: 20
|
|
searchPattern?: string // Default: '.'
|
|
fuzzy?: boolean // Default: true
|
|
}
|
|
```
|
|
|
|
## Usage
|
|
|
|
```typescript
|
|
// Basic fuzzy search
|
|
const files = await window.api.file.listDirectory(dirPath, {
|
|
searchPattern: 'updater',
|
|
fuzzy: true,
|
|
maxEntries: 20
|
|
})
|
|
|
|
// Disable fuzzy search (exact glob matching)
|
|
const files = await window.api.file.listDirectory(dirPath, {
|
|
searchPattern: 'update',
|
|
fuzzy: false
|
|
})
|
|
```
|
|
|
|
## Performance Considerations
|
|
|
|
1. **Ripgrep Pre-filtering**: Most queries are handled by ripgrep's native glob matching, which is extremely fast
|
|
2. **Fallback Only When Needed**: Greedy substring matching (which loads all files) only runs when glob matching returns empty results
|
|
3. **Result Limiting**: Only top 20 results are returned by default
|
|
4. **Excluded Directories**: Common large directories are automatically excluded:
|
|
- `node_modules`
|
|
- `.git`
|
|
- `dist`, `build`
|
|
- `.next`, `.nuxt`
|
|
- `coverage`, `.cache`
|
|
|
|
## Implementation Details
|
|
|
|
The implementation is located in `src/main/services/FileStorage.ts`:
|
|
|
|
- `queryToGlobPattern()`: Converts query to ripgrep glob pattern
|
|
- `isFuzzyMatch()`: Subsequence matching algorithm
|
|
- `isGreedySubstringMatch()`: Greedy substring matching fallback
|
|
- `getFuzzyMatchScore()`: Calculates relevance score
|
|
- `listDirectoryWithRipgrep()`: Main search orchestration
|