【Hacker News搬运】RAGFlow是一个基于OCR和文档解析的开源RAG引擎

hackernews

Title: RAGFlow is an open-source RAG engine based on OCR and document parsing

RAGFlow是一个基于OCR和文档解析的开源RAG引擎

Text:

Url: https://github.com/infiniflow/ragflow

RAGFlow是一个基于深度文档理解的开放源代码RAG（检索增强生成）引擎。它为任何规模的企业提供了一个简化的RAG工作流程，结合了LLM（大型语言模型）以提供真实的问答能力，得到了各种复杂格式数据中坚实基础的引用支持。

"质量入，质量出"
从复杂格式的非结构化数据中基于深度文档理解进行知识提取。
在无限的令牌"数据 haystack"中找到"针"。
🍱基于模板的块分割
智能且可解释。
有大量模板选项可供选择。
🌱减少幻觉的 grounded 引用
允许人工干预的文本块可视化。
快速查看关键引用和支持有根据的答案的可追溯引用。
🍔支持异构数据源
支持Word、幻灯片、Excel、txt、图片、扫描件、结构化数据、网页等。
🛀简化的RAG工作流程
为个人和大型企业定制的简化RAG编排。
可配置的LLM以及嵌入式模型。
多重召回与融合重新排名。
为与企业无缝集成而设计的直观API。
CPU >= 2核
RAM >= 8 GB
Docker
如果您还没有在您的本地机器（Windows、Mac或Linux）上安装Docker，请参见安装Docker引擎。
确保vm.max_map_count > 65535：
要检查vm.max_map_count的值：
$ sysctl vm.max_map_count
如果不大于65535，将vm.max_map_count设置为一个大于65535的值：
# 在这个情况下，我们将其设置为262144：
$ sudo sysctl -w vm.max_map_count=262144
此更改在系统重启后将重置。为确保您的更改永久有效，请在/etc/sysctl.conf中添加或更新vm.max_map_count值：
克隆仓库：
$ git clone https://github.com/infiniflow/ragflow.git
构建预先构建的Docker镜像并启动服务器：
$ cd ragflow/docker
$ docker-compose up -d
核心镜像大约15 GB大小，可能需要一段时间才能加载。
在服务器启动并运行后检查服务器状态：
$ docker logs -f ragflow-server
以下输出确认了系统的成功启动：
    ____                 ______ __
   / __ \ ____ _ ____ _ / ____// /____  _      __
  / /_/ // __ `// __ `// /_   / // __ \| | /| / /
 / _, _// /_/ // /_/ // __/  / // /_/ /| |/ |/ /
/_/ |_| \__,_/ \__, //_/    /_/ \____/ |__/|__/
              /____/
 * 在所有地址上运行 (0.0.0.0)
 * 在 http://127.0.0.1:9380 上运行
 * 在 http://172.22.0.5:9380 上运行
 INFO:werkzeug:按CTRL+C退出
在浏览器中输入服务器的IP地址，如提示操作并登录到RAGFlow。
在给定场景中，您只需输入 http://172.22.0.5（不带端口号）作为默认的HTTP服务端口80可以省略使用默认配置。
在service_conf.yaml中，选择用户默认的LLM工厂并更新相应的API_KEY字段。
更多信息请参见./docs/llm_api_key_setup.md。
演出现在开始！
在系统配置方面，您需要管理以下文件：
.env：保存系统的根本设置，例如SVR_HTTP_PORT、MYSQL_PASSWORD和MINIO_PASSWORD。
service_conf.yaml：配置后端服务。
docker-compose.yml：系统依赖docker-compose.yml来启动。
您必须确保.env文件的更改与service_conf.yaml文件中的内容一致。
./docker/README文件提供了详细的描述环境设置和服务配置，并且您必须确保所有在./docker/README文件中列出的环境设置与service_conf.yaml文件中的相应配置保持一致。
要更新默认的HTTP服务端口（80），请转到docker-compose.yml并将80:80更改为

Post by: marban

Comments:

bgun: I know this is just an open source project but it’s a good example of why you might want to consult a woman before naming things.

bgun: 我知道这只是一个开源项目，但它是一个很好的例子，说明了为什么你可能想在命名之前咨询一位女性。

mpeg: Took me some time to figure out how to run it, but the layout recogniser model hosted on huggingface is pretty good!It correctly identifies tables that even paid models like the AWS Textract Document Analysis API fails to – for instance tables with one column which often confuse AWS even if they have a clear header and are labelled "Table" in the text.I would however love to know broadly what kind of document it was trained on, as my results could be pure luck, hard to say without a proper benchmarkVery nice layout recognition, although I can't quite comment on the RAG performance itself – I think some of the architecture decisions are odd, it mixes a bunch of different PDF parsers for example which will all result in different quality and it's not clear to me which one it defaults to as it seems to be different in different places in the code (the simple parser defaults to pypdf2 which is not a great option)

mpeg: 我花了一些时间才弄清楚如何运行它，但在huggingface上托管的布局识别器模型非常好 它正确地识别了即使是像AWS Textract Document Analysis API这样的付费模型也无法识别的表——例如，具有一列的表，即使它们具有清晰的标题并被标记为“；表“；在文本中 然而，我很想广泛地知道它是在什么样的文档上训练的，因为我的结果可能纯粹是运气，很难说没有合适的基准非常好的布局识别，尽管我可以；我对RAG性能本身没有太多评论——我认为一些架构决策很奇怪，例如，它混合了一堆不同的PDF解析器，这些解析器都会导致不同的质量；我不清楚它默认为哪一个，因为它在代码的不同位置似乎不同（简单解析器默认为pypdf2，这不是一个好的选项）

constantinum: Document processing is getting better and better with new tools leveraging LLMs.
If anyone is interested in exploring this space, try another similar tool LLMWhisperer (<a href="https://llmwhisperer.unstract.com/" rel="nofollow">https://llmwhisperer.unstract.com/</a>). It is a part of Unstract, an open-source document processing tool (<a href="https://github.com/Zipstack/unstract">https://github.com/Zipstack/unstract</a>)

constantinum: 有了利用LLM的新工具，文档处理变得越来越好。如果有人有兴趣探索这个空间，请尝试另一个类似的工具LLMWhisperer（<a href=“https://x2F；&#x2F；LLMWhisperer.unstract.com&#x2F”rel=“nofollow”>https://x2F！&#x2F！LLMWhisperer.unstract.com&#x20F；</a>）。它是开源文档处理工具Unstract的一部分（<a href=“https://；&#x2F；github.com&#x2F，Zipstack&#x2F！Unstract”>https://；#xx2F；github.com&#x2F；Zipstack&#x20F；Unstract</a>）

gardenfelder: It seems to be limited to certain LLM servers, on of which is OpenAI, none of which includes e.g. Mystral and popular OSS LLMs.I wonder if that will change - eventually.Discord channels are named in Chinese, though there are English posts.

gardenfelder: 它似乎仅限于某些LLM服务器，其中包括OpenAI，其中没有一个包括Mystral和流行的OSS LLM 我想知道这种情况最终是否会改变 Discord频道以中文命名，但也有英文帖子。

zzleeper: I'm partly sad at the approach this and other engines take: reimplement each part (PDF parser, etc etc) in a way where they are pretty much useless except in their specific engine.If instead we had a PDF() class that did what RAGFlow is doing (dealing with all the different trade-offs of the different python PDF engines such as pdfplumber), then we could easily adapt it and improve it, and it can be useful for other projects as well.

zzleeper: I-；我对这个引擎和其他引擎所采取的方法感到有些难过：重新实现每个部分（PDF解析器等），除了在特定的引擎中，它们几乎没有用处 相反，如果我们有一个PDF（）类来做RAGFlow正在做的事情（处理不同python PDF引擎（如pdfplumber）的所有不同权衡），那么我们可以很容易地对其进行调整和改进，它也可以对其他项目有用。