桌面搜索:recoll还是AnyTxt?推荐recoll

Everything是一款必选的桌面文件搜索,但是这个软件没法针对内容进行检索。在Google Desktop停用之后, 也一直苦苦追寻了很久,甚至找过一个开源Lucene的方案,但是要自己改造并且写的前端。后来就知道有一个Linux平台的开源内容搜索引擎recoll,看了一下,正是要找的!但是这个recoll需要移植到Windows才能使用。有一个移植,但是并不开放,于是就停了一阵子没管。后来有人找到了可以使用的Windows版,赶快下载一个,果然不错!

同时也有一个内容搜索软件,号称Everything伴侣,是个中国开发者的作品,AnyTxt。也安装上比较一下。

image

一、recoll

recoll是Jean-Francois Dockes等一些开发者维护的一个桌面搜索引擎项目,是基于GPLv2的开源软件。这个项目大约始于2012年9月11日。2020年5月分布的版本是1.27.0,看来依然是一个活跃的项目。作者开发了一个Windows的移植版本https://www.lesbonscomptes.com/recoll/pages/recoll-windows.html,但是看来这个移植工作比较费劲,作者要求捐款才能下载(5/10/20欧元即可)。现在网上可以看到1.26.3的Windows版本。我下载的也是这个版本,作者用了Qt作为跨平台的GUI系统。样子和原生的Windows应用没啥区别。

image

Recoll的使用和其他的搜索引擎类似,首先需要对本机文件做索引,可以设置不索引的目录,例如windows,然后就等待索引结束吧。记得大约索引了1-2天的样子,速度还可以,生成的索引文件是20GB左右,目前看不会自动更新,需要手动才行。recoll会自动索引压缩包中的文件,这一点比较方便。

image

搜索速度很快,可以分页显示,显示方式是摘要+链接方式。recoll的文档也非常完整,https://www.lesbonscomptes.com/recoll/usermanual/

这个搜索引擎让我瞬间找到了很多尘封已久的信息,重点推荐

 二、AnyTxt

AnyTxt是个免费软件https://anytxt.net/,并不开源。看上去作者是中国人,大约始于2019年5月23日,目前已知在活跃更新。最新的版本是1.2.201。AnyTxt也是基于Qt做的GUI界面。是自动后台做索引的,可能是采用了常用文件索引的策略,(同样可以设置不索引目录,但采用按照文件后缀的方式设置,而不是全局设置,这一点不如recoll)。

image

搜索结果没有显示摘要。这一点不太友好。索引保存在SQLlite数据库中,这个设计的结果就是查询结果显示的速度非常慢,要排序一阵子,比不上recoll。由于利用了数据库,索引可能相对小一些,查看了一下,大约2GB的样子,这个也不能说好到哪里去,因为有的数据并没有被索引。

image

如果您喜欢傻瓜型应用,类似Everything这种的,用AnyTxt更好,如果喜欢更为复杂设置、支持的文档更多、排序更好的系统,另外存储空间如果也比较多,建议使用recoll

 

相关介绍和下载链接

https://www.lesbonscomptes.com/recoll/index.html

https://www.lesbonscomptes.com/recoll/pages/recoll-windows.html

https://gitter.im/anytxt/community

https://anytxt.net/

https://sourceforge.net/projects/anytxt/

 

(1)recoll支持的文档类型

File types indexed natively
  • text
  • html
  • maildir, mh, and mailbox (Mozilla, Thunderbird and Evolution mail ok). Evolution note: be sure to remove .cache from the skippedNames list in the GUI Indexing preferences/Local Parameters pane if you want to index local copies of Imap mail. Outlook archives are processed with an external helper, see further.
  • gaim and purple log files.
  • Scribus files.
  • Man pages (needs groff).
  • Mimehtml web archive format (this is based on the mail filter, which introduces some mild weirdness, but is still usable).

All the following need Python3 (or Python2 for older Recoll versions):

  • Dia diagrams.
  • Excel and Powerpoint files (pre-open-xml).
  • Tar archives. Tar file indexing is disabled by default (because tar archives don’t typically contain the kind of documents that people search for), you will need to enable it explicitely, e.g., with the following in your $HOME/.recoll/mimeconf file:
[index]
application/x-tar = execm rcltar
  • Zip archives.
  • Konqueror webarchive format (uses the tarfile Python standard library module).
File types indexed with external helpers
The XML ones

Recoll 1.26 and later process XML internally, by using the libxml2 and libxslt C++ libraries. Quite a few formats also need the unzip command.

Recoll 1.25 used python3-lxml. Versions from 1.22 to 1.24 used python-libxslt and python-libxml2, Versions older than 1.22 needed the xsltproc command.

  • OpenOffice files.
  • Microsoft Office Open XML files.
  • Abiword files.
  • Kword files.
  • Fb2 ebooks.
  • SVG files.
  • Gnumeric files.
  • Okular annotations files.
Other formats

The following need miscellaneous helper programs to extract the document text.

  • PDF needs the pdftotext command, which comes with poppler. The package name is quite often poppler-utils. Note: the older pdftotext command which comes with xpdf is not compatible with Recoll. PDF has its own section further, with details about OCR support and opening documents at the right page.
  • Microsoft Word is processed with antiword, which is not maintained much, but keeps working. I maintain a veryslightly improved antiword version, it can extract a little extra data in some cases. It is also useful to have wvWareinstalled as Recoll can use it as a fallback for some files which antiword does not handle.
  • RTF files with unrtf. Note that up to version 0.21.3, unrtf mostly does not work with non western-european character sets. Many serious problems (crashes with serious security implications and infinite loops) were fixed in unrtf 0.21.8, so you really want to use this or a newer release. Building unrtf from source is quick and easy.
  • CHM (Microsoft help) files with Python pychm and chmlib. Recoll 1.25 and later bundle a Python3 version of the CHM package, (this is necessary because the original package was not ported to Python3).
  • EPUB files with Python and the epub module, which is packaged on Fedora, but not Debian. The packaged version by the original author (0.5.2) is old and suffers from a lot of bitrot, so Recoll now bundles an unpackaged version, updated by Arthur Darcet.
  • Microsoft Outlook .pst and .ost files are processed with libpff. We use a slightly modified version (to provide streaming output), stored in this repository
  • Hancom office Hanword .hwp format for Korean text processing, using the pyhwp Python module. See the the module page. Use pip3 install pyhwp to install on Linux. This will be bundled with Recoll Windows future versions (1.26.6 and later). On Debian, you also probably want to install the fonts-nanum package, which is not part of the default install.
  • Wordperfect is processed with the wpd2html command from libwpd package. On some distributions, the command may come with a package named libwpd-tools or such, not the base libwpd package.
  • djvu with DjVuLibre.
  • Audio: Recoll releases 1.14 and later use a Python script based on the mutagen package to extract tags for all audio types.
  • Images tags are extracted with perl and exiftool.
  • GNU info files are processed with Python and the info command.
  • Lyx files need Lyx to be installed.
  • Rar archives with the Python rarfile module and the unrar utility. The Python module is packaged as python3-rarfile by both Fedora and Debian. Note that the free version of unrar (unrar-free) fails for many files with the message “Failed the read enough data”.
  • 7zip archives with the pylzma module.
  • iCalendar(.ics) files with the icalendar module.
  • Mozilla calendar data. See the Howto about this.
  • Postscript with the ghostscript, ps2pdf command, and pdftotext from poppler.
  • TeX with untex. If there is no untex package for your distribution, this site stores a source package, as untex has no obvious home. Will also work with detex if this is installed.
  • DVI with catdvi.
  • Midi karaoke files are processed with the Python midi module, and some help from chardet. There is probably apython-chardet package for your distribution, but you will quite probably need to build the midi package. This is easy but see the notes here. Recoll 1.24 and later bundle the midi decoding module (modified and ported to python3), and just need the standard Python ‘six’ module and chardet.
  • MediaWiki dump files: Thomas Levine has written a handler for these, you will find it here: rclmwdump.

 

(2)AnyTxt支持的类型

Formats Supported
  • Plain Text Format (txt, cpp, html, etc.)
  • Microsoft Outlook (eml)
  • Microsoft Word (doc, docx)
  • Microsoft Excel (xls, xlsx)
  • Microsoft PowerPoint (ppt, pptx)
  • Portable Document Format (pdf) (beta)
  • More Document Types are coming
More Features
  • Microsoft Office (doc, xls, ppt) Supported
  • Microsoft Office 2007 (docx, xlsx, pptx, docm, xlsm, docm) Supported
  • PDF Supported(Beta)
  • Non-English document Supported
  • Full Text Search
  • Real Time Search (Beta)
  • SSD Optimization
  • Fast Index
  • Fast Search

发表评论

电子邮件地址不会被公开。 必填项已用*标注