mirror of
https://github.com/CherryHQ/cherry-studio.git
synced 2026-01-05 12:29:44 +08:00
* feat: add fuzzy search for file list with relevance scoring - Add fuzzy option to DirectoryListOptions (default: true) - Implement isFuzzyMatch for subsequence matching - Add getFuzzyMatchScore for relevance-based sorting - Remove searchByContent method (content-based search) - Increase maxDepth to 10 and maxEntries to 20 * perf: optimize fuzzy search with ripgrep glob pre-filtering - Add queryToGlobPattern to convert query to glob pattern - Use ripgrep --iglob for initial filtering instead of loading all files - Reduces memory footprint and improves performance for large directories * feat: add greedy substring match fallback for fuzzy search - Add isGreedySubstringMatch for flexible matching - Fallback to greedy match when glob pre-filter returns empty - Allows 'updatercontroller' to match 'updateController.ts' * fix: improve greedy substring match algorithm - Search from longest to shortest substring for better matching - Fix issue where 'updatercontroller' couldn't match 'updateController' * docs: add fuzzy search documentation (en/zh) * refactor: extract MAX_ENTRIES_PER_SEARCH constant * refactor: use logarithmic scaling for path length penalty - Replace linear penalty (0.8 * length) with logarithmic scaling - Prevents long paths from dominating the score - Add PATH_LENGTH_PENALTY_FACTOR constant with explanation * refactor: extract scoring constants with documentation - Add named constants for scoring factors (SCORE_SEGMENT_MATCH, etc.) - Update en/zh documentation with scoring strategy explanation * refactor: move PATH_LENGTH_PENALTY_FACTOR to class level constant * refactor: extract buildRipgrepBaseArgs helper method - Reduce code duplication for ripgrep argument building - Consolidate directory exclusion patterns and depth handling * refactor: rename MAX_ENTRIES_PER_SEARCH to MAX_SEARCH_RESULTS * fix: escape ! character in glob pattern for negation support * fix: avoid duplicate scoring for filename starts and contains * docs: clarify fuzzy search filtering and scoring strategies * fix: limit word boundary bonus to single match * fix: add dedicated scoring for greedy substring match - Add getGreedyMatchScore function that rewards fewer fragments and tighter matches - Add isFuzzyMatch validation before scoring in fuzzy glob path - Use greedy scoring for fallback path to properly rank longest matches first Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com> --------- Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
130 lines
4.1 KiB
Markdown
130 lines
4.1 KiB
Markdown
# 文件列表模糊搜索
|
||
|
||
本文档描述了 Cherry Studio 中文件列表的模糊搜索实现。
|
||
|
||
## 概述
|
||
|
||
模糊搜索功能允许用户通过输入部分或近似的文件名/路径来查找文件。它使用两层文件过滤策略(ripgrep glob 预过滤 + 贪婪子串匹配回退),结合基于子序列的评分,以获得最佳性能和灵活性。
|
||
|
||
## 功能特性
|
||
|
||
- **Ripgrep Glob 预过滤**:使用 glob 模式进行快速原生级过滤的主要过滤策略
|
||
- **贪婪子串匹配**:当 ripgrep glob 预过滤无结果时的回退文件过滤策略
|
||
- **基于子序列的段评分**:评分时,当查询字符按顺序出现时,路径段获得额外权重
|
||
- **相关性评分**:结果按多因素相关性分数排序
|
||
|
||
## 匹配策略
|
||
|
||
### 1. Ripgrep Glob 预过滤(主要)
|
||
|
||
查询被转换为 glob 模式供 ripgrep 进行初始过滤:
|
||
|
||
```
|
||
查询: "updater"
|
||
Glob: "*u*p*d*a*t*e*r*"
|
||
```
|
||
|
||
这利用了 ripgrep 的原生性能进行初始文件过滤。
|
||
|
||
### 2. 贪婪子串匹配(回退)
|
||
|
||
当 glob 预过滤无结果时,系统回退到贪婪子串匹配。这允许更灵活的匹配:
|
||
|
||
```
|
||
查询: "updatercontroller"
|
||
文件: "packages/update/src/node/updateController.ts"
|
||
|
||
匹配过程:
|
||
1. 找到 "update"(从开头的最长匹配)
|
||
2. 剩余 "rcontroller" → 找到 "r" 然后 "controller"
|
||
3. 所有部分都匹配 → 成功
|
||
```
|
||
|
||
## 评分算法
|
||
|
||
结果根据 `FileStorage.ts` 中定义的命名常量进行相关性分数排名:
|
||
|
||
| 常量 | 值 | 描述 |
|
||
|------|-----|------|
|
||
| `SCORE_FILENAME_STARTS` | 100 | 文件名以查询开头(最高优先级)|
|
||
| `SCORE_FILENAME_CONTAINS` | 80 | 文件名包含精确查询子串 |
|
||
| `SCORE_SEGMENT_MATCH` | 60 | 每个匹配查询的路径段 |
|
||
| `SCORE_WORD_BOUNDARY` | 20 | 查询匹配单词开头 |
|
||
| `SCORE_CONSECUTIVE_CHAR` | 15 | 每个连续字符匹配 |
|
||
| `PATH_LENGTH_PENALTY_FACTOR` | 4 | 较长路径的对数惩罚 |
|
||
|
||
### 评分策略
|
||
|
||
评分优先级:
|
||
1. **文件名匹配**(最高):查询出现在文件名中的文件最相关
|
||
2. **路径段匹配**:多个匹配段表示更强的相关性
|
||
3. **词边界**:在单词开头匹配(如 "upd" 匹配 "update")更优先
|
||
4. **连续匹配**:更长的连续字符序列得分更高
|
||
5. **路径长度**:较短路径更优先(对数惩罚防止长路径主导评分)
|
||
|
||
### 评分示例
|
||
|
||
对于查询 `updater`:
|
||
|
||
| 文件 | 评分因素 |
|
||
|------|----------|
|
||
| `RCUpdater.js` | 短路径 + 文件名包含 "updater" |
|
||
| `updateController.ts` | 多个路径段匹配 |
|
||
| `UpdaterHelper.plist` | 长路径惩罚 |
|
||
|
||
## 配置
|
||
|
||
### DirectoryListOptions
|
||
|
||
```typescript
|
||
interface DirectoryListOptions {
|
||
recursive?: boolean // 默认: true
|
||
maxDepth?: number // 默认: 10
|
||
includeHidden?: boolean // 默认: false
|
||
includeFiles?: boolean // 默认: true
|
||
includeDirectories?: boolean // 默认: true
|
||
maxEntries?: number // 默认: 20
|
||
searchPattern?: string // 默认: '.'
|
||
fuzzy?: boolean // 默认: true
|
||
}
|
||
```
|
||
|
||
## 使用方法
|
||
|
||
```typescript
|
||
// 基本模糊搜索
|
||
const files = await window.api.file.listDirectory(dirPath, {
|
||
searchPattern: 'updater',
|
||
fuzzy: true,
|
||
maxEntries: 20
|
||
})
|
||
|
||
// 禁用模糊搜索(精确 glob 匹配)
|
||
const files = await window.api.file.listDirectory(dirPath, {
|
||
searchPattern: 'update',
|
||
fuzzy: false
|
||
})
|
||
```
|
||
|
||
## 性能考虑
|
||
|
||
1. **Ripgrep 预过滤**:大多数查询由 ripgrep 的原生 glob 匹配处理,速度极快
|
||
2. **仅在需要时回退**:贪婪子串匹配(加载所有文件)仅在 glob 匹配返回空结果时运行
|
||
3. **结果限制**:默认只返回前 20 个结果
|
||
4. **排除目录**:自动排除常见的大型目录:
|
||
- `node_modules`
|
||
- `.git`
|
||
- `dist`、`build`
|
||
- `.next`、`.nuxt`
|
||
- `coverage`、`.cache`
|
||
|
||
## 实现细节
|
||
|
||
实现位于 `src/main/services/FileStorage.ts`:
|
||
|
||
- `queryToGlobPattern()`:将查询转换为 ripgrep glob 模式
|
||
- `isFuzzyMatch()`:子序列匹配算法
|
||
- `isGreedySubstringMatch()`:贪婪子串匹配回退
|
||
- `getFuzzyMatchScore()`:计算相关性分数
|
||
- `listDirectoryWithRipgrep()`:主搜索协调
|