cherry-studio/docs/zh/references/fuzzy-search.md
beyondkmp bc9eeb9f30
feat: add fuzzy search for file list with relevance scoring (#12131)
* feat: add fuzzy search for file list with relevance scoring

- Add fuzzy option to DirectoryListOptions (default: true)
- Implement isFuzzyMatch for subsequence matching
- Add getFuzzyMatchScore for relevance-based sorting
- Remove searchByContent method (content-based search)
- Increase maxDepth to 10 and maxEntries to 20

* perf: optimize fuzzy search with ripgrep glob pre-filtering

- Add queryToGlobPattern to convert query to glob pattern
- Use ripgrep --iglob for initial filtering instead of loading all files
- Reduces memory footprint and improves performance for large directories

* feat: add greedy substring match fallback for fuzzy search

- Add isGreedySubstringMatch for flexible matching
- Fallback to greedy match when glob pre-filter returns empty
- Allows 'updatercontroller' to match 'updateController.ts'

* fix: improve greedy substring match algorithm

- Search from longest to shortest substring for better matching
- Fix issue where 'updatercontroller' couldn't match 'updateController'

* docs: add fuzzy search documentation (en/zh)

* refactor: extract MAX_ENTRIES_PER_SEARCH constant

* refactor: use logarithmic scaling for path length penalty

- Replace linear penalty (0.8 * length) with logarithmic scaling
- Prevents long paths from dominating the score
- Add PATH_LENGTH_PENALTY_FACTOR constant with explanation

* refactor: extract scoring constants with documentation

- Add named constants for scoring factors (SCORE_SEGMENT_MATCH, etc.)
- Update en/zh documentation with scoring strategy explanation

* refactor: move PATH_LENGTH_PENALTY_FACTOR to class level constant

* refactor: extract buildRipgrepBaseArgs helper method

- Reduce code duplication for ripgrep argument building
- Consolidate directory exclusion patterns and depth handling

* refactor: rename MAX_ENTRIES_PER_SEARCH to MAX_SEARCH_RESULTS

* fix: escape ! character in glob pattern for negation support

* fix: avoid duplicate scoring for filename starts and contains

* docs: clarify fuzzy search filtering and scoring strategies

* fix: limit word boundary bonus to single match

* fix: add dedicated scoring for greedy substring match

- Add getGreedyMatchScore function that rewards fewer fragments and tighter matches
- Add isFuzzyMatch validation before scoring in fuzzy glob path
- Use greedy scoring for fallback path to properly rank longest matches first

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>

---------

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
2025-12-30 19:42:56 +08:00

130 lines
4.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# 文件列表模糊搜索
本文档描述了 Cherry Studio 中文件列表的模糊搜索实现。
## 概述
模糊搜索功能允许用户通过输入部分或近似的文件名/路径来查找文件。它使用两层文件过滤策略ripgrep glob 预过滤 + 贪婪子串匹配回退),结合基于子序列的评分,以获得最佳性能和灵活性。
## 功能特性
- **Ripgrep Glob 预过滤**:使用 glob 模式进行快速原生级过滤的主要过滤策略
- **贪婪子串匹配**:当 ripgrep glob 预过滤无结果时的回退文件过滤策略
- **基于子序列的段评分**:评分时,当查询字符按顺序出现时,路径段获得额外权重
- **相关性评分**:结果按多因素相关性分数排序
## 匹配策略
### 1. Ripgrep Glob 预过滤(主要)
查询被转换为 glob 模式供 ripgrep 进行初始过滤:
```
查询: "updater"
Glob: "*u*p*d*a*t*e*r*"
```
这利用了 ripgrep 的原生性能进行初始文件过滤。
### 2. 贪婪子串匹配(回退)
当 glob 预过滤无结果时,系统回退到贪婪子串匹配。这允许更灵活的匹配:
```
查询: "updatercontroller"
文件: "packages/update/src/node/updateController.ts"
匹配过程:
1. 找到 "update"(从开头的最长匹配)
2. 剩余 "rcontroller" → 找到 "r" 然后 "controller"
3. 所有部分都匹配 → 成功
```
## 评分算法
结果根据 `FileStorage.ts` 中定义的命名常量进行相关性分数排名:
| 常量 | 值 | 描述 |
|------|-----|------|
| `SCORE_FILENAME_STARTS` | 100 | 文件名以查询开头(最高优先级)|
| `SCORE_FILENAME_CONTAINS` | 80 | 文件名包含精确查询子串 |
| `SCORE_SEGMENT_MATCH` | 60 | 每个匹配查询的路径段 |
| `SCORE_WORD_BOUNDARY` | 20 | 查询匹配单词开头 |
| `SCORE_CONSECUTIVE_CHAR` | 15 | 每个连续字符匹配 |
| `PATH_LENGTH_PENALTY_FACTOR` | 4 | 较长路径的对数惩罚 |
### 评分策略
评分优先级:
1. **文件名匹配**(最高):查询出现在文件名中的文件最相关
2. **路径段匹配**:多个匹配段表示更强的相关性
3. **词边界**:在单词开头匹配(如 "upd" 匹配 "update")更优先
4. **连续匹配**:更长的连续字符序列得分更高
5. **路径长度**:较短路径更优先(对数惩罚防止长路径主导评分)
### 评分示例
对于查询 `updater`
| 文件 | 评分因素 |
|------|----------|
| `RCUpdater.js` | 短路径 + 文件名包含 "updater" |
| `updateController.ts` | 多个路径段匹配 |
| `UpdaterHelper.plist` | 长路径惩罚 |
## 配置
### DirectoryListOptions
```typescript
interface DirectoryListOptions {
recursive?: boolean // 默认: true
maxDepth?: number // 默认: 10
includeHidden?: boolean // 默认: false
includeFiles?: boolean // 默认: true
includeDirectories?: boolean // 默认: true
maxEntries?: number // 默认: 20
searchPattern?: string // 默认: '.'
fuzzy?: boolean // 默认: true
}
```
## 使用方法
```typescript
// 基本模糊搜索
const files = await window.api.file.listDirectory(dirPath, {
searchPattern: 'updater',
fuzzy: true,
maxEntries: 20
})
// 禁用模糊搜索(精确 glob 匹配)
const files = await window.api.file.listDirectory(dirPath, {
searchPattern: 'update',
fuzzy: false
})
```
## 性能考虑
1. **Ripgrep 预过滤**:大多数查询由 ripgrep 的原生 glob 匹配处理,速度极快
2. **仅在需要时回退**:贪婪子串匹配(加载所有文件)仅在 glob 匹配返回空结果时运行
3. **结果限制**:默认只返回前 20 个结果
4. **排除目录**:自动排除常见的大型目录:
- `node_modules`
- `.git`
- `dist`、`build`
- `.next`、`.nuxt`
- `coverage`、`.cache`
## 实现细节
实现位于 `src/main/services/FileStorage.ts`
- `queryToGlobPattern()`:将查询转换为 ripgrep glob 模式
- `isFuzzyMatch()`:子序列匹配算法
- `isGreedySubstringMatch()`:贪婪子串匹配回退
- `getFuzzyMatchScore()`:计算相关性分数
- `listDirectoryWithRipgrep()`:主搜索协调