【Hacker News搬运】显示HN:FastGraphRAG–使用旧版PageRank进行更好的RAG

hackernews

Title: Show HN: FastGraphRAG – Better RAG using good old PageRank

显示HN:FastGraphRAG–使用旧版PageRank进行更好的RAG

Text: Hey there HN! We’re Antonio, Luca, and Yuhang, and we’re excited to introduce Fast GraphRAG, an open-source RAG approach that leverages knowledge graphs and the 25 years old PageRank for better information retrieval and reasoning.Building a good RAG pipeline these days takes a lot of manual optimizations. Most engineers intuitively start from naive RAG: throw everything in a vector database and hope that semantic search is powerful enough. This can work for use cases where accuracy isn’t too important and hallucinations are tolerable, but it doesn’t work for more difficult queries that involve multi-hop reasoning or more advanced domain understanding. Also, it’s impossible to debug it.To address these limitations, many engineers find themselves adding extra layers like agent-based preprocessing, custom embeddings, reranking mechanisms, and hybrid search strategies. Much like the early days of machine learning when we manually crafted feature vectors to squeeze out marginal gains, building an effective RAG system often becomes an exercise in crafting engineering “hacks.”Earlier this year, Microsoft seeded the idea of using Knowledge Graphs for RAG and published GraphRAG - i.e. RAG with Knowledge Graphs. We believe that there is an incredible potential in this idea, but existing implementations are naive in the way they create and explore the graph. That’s why we developed Fast GraphRAG with a new algorithmic approach using good old PageRank.There are two main challenges when building a reliable RAG system:(1) Data Noise: Real-world data is often messy. Customer support tickets, chat logs, and other conversational data can include a lot of irrelevant information. If you push noisy data into a vector database, you’re likely to get noisy results.(2) Domain Specialization: For complex use cases, a RAG system must understand the domain-specific context. This requires creating representations that capture not just the words but the deeper relationships and structures within the data.Our solution builds on these insights by incorporating knowledge graphs into the RAG pipeline. Knowledge graphs store entities and their relationships, and can help structure data in a way that enables more accurate and context-aware information retrieval. 12 years ago Google announced the knowledge graph we all know about [1]. It was a pioneering move. Now we have LLMs, meaning that people can finally do RAG on their own data with tools that can be as powerful as Google’s original idea.Before we built this, Antonio was at Amazon, while Luca and Yuhang were finishing their PhDs at Oxford. We had been thinking about this problem for years and we always loved the parallel between pagerank and the human memory [2]. We believe that searching for memories is incredibly similar to searching the web.Here’s how it works:- Entity and Relationship Extraction: Fast GraphRAG uses LLMs to extract entities and their relationships from your data and stores them in a graph format [3].- Query Processing: When you make a query, Fast GraphRAG starts by finding the most relevant entities using vector search, then runs a personalized PageRank algorithm to determine the most important “memories” or pieces of information related to the query [4].- Incremental Updates: Unlike other graph-based RAG systems, Fast GraphRAG natively supports incremental data insertions. This means you can continuously add new data without reprocessing the entire graph.- Faster: These design choices make our algorithm faster and more affordable to run than other graph-based RAG systems because we eliminate the need for communities and clustering.Suppose you’re analyzing a book and want to focus on character interactions, locations, and significant events:<pre><code> from fast_graphrag import GraphRAG

DOMAIN = "Analyze this story and identify the characters. Focus on how they interact with each other, the locations they explore, and their relationships."

EXAMPLE_QUERIES = [
"What is the significance of Christmas Eve in A Christmas Carol?",
"How does the setting of Victorian London contribute to the story's themes?",
"Describe the chain of events that leads to Scrooge's transformation.",
"How does Dickens use the different spirits (Past, Present, and Future) to guide Scrooge?",
"Why does Dickens choose to divide the story into "staves" rather than chapters?"
]

ENTITY_TYPES = ["Character", "Animal", "Place", "Object", "Activity", "Event"]

grag = GraphRAG(
working_dir="./book_example",
domain=DOMAIN,
example_queries="\n".join(EXAMPLE_QUERIES),
entity_types=ENTITY_TYPES
)

with open("./book.txt") as f:
grag.insert(f.read())

print(grag.query("Who is Scrooge?").response)
</code></pre>
This code creates a domain-specific knowledge graph based on your data, example queries, and specified entity types. Then you can query it in plain English while it automatically handles all the data fetching, entity extractions, co-reference resolutions, memory elections, etc. When you add new data, locking and checkpointing is handled for you as well.This is the kind of infrastructure that GenAI apps need to handle large-scale real-world data. Our goal is to give you this infrastructure so that you can focus on what’s important: building great apps for your users without having to care about manually engineering a retrieval pipeline. In the managed service, we also have a suite of UI tools for you to explore and debug your knowledge graph.We have a free hosted solution with up to 100 monthly requests. When you’re ready to grow, we have paid plans that scale with you. And of course you can self host our open-source engine.Give us a spin today at <a href="https://circlemind.co">https://circlemind.co</a> and see our code at <a href="https://github.com/circlemind-ai/fast-graphrag">https://github.com/circlemind-ai/fast-graphrag</a>We’d love feedback :)[1] <a href="https://blog.google/products/search/introducing-knowledge-graph-things-not/" rel="nofollow">https://blog.google/products/search/introducing-knowledge-gr...</a>[2] Griffiths, T. L., Steyvers, M., & Firl, A. (2007). Google and the Mind: Predicting Fluency with PageRank. Psychological Science, 18(12), 1069–1076. <a href="http://www.jstor.org/stable/40064705" rel="nofollow">http://www.jstor.org/stable/40064705</a>[3] Similarly to Microsoft’s GraphRAG: <a href="https://github.com/microsoft/graphrag">https://github.com/microsoft/graphrag</a>[4] Similarly to OSU’s HippoRAG: <a href="https://github.com/OSU-NLP-Group/HippoRAG">https://github.com/OSU-NLP-Group/HippoRAG</a><a href="https://vhs.charm.sh/vhs-4fCicgsbsc7UX0pemOcsMp.gif" rel="nofollow">https://vhs.charm.sh/vhs-4fCicgsbsc7UX0pemOcsMp.gif</a>

嘿，HN！我们是Antonio、Luca和Yuhang，我们很高兴介绍Fast GraphRAG，这是一种开源的RAG方法，利用知识图和已有25年历史的PageRank进行更好的信息检索和推理<p> 如今，构建一个好的RAG管道需要大量的手动优化。大多数工程师直观地从简单的RAG开始：将所有内容都放入向量数据库中，并希望语义搜索足够强大。这适用于准确性不太重要且幻觉可以容忍的用例，但不适用于涉及多跳推理或更高级领域理解的更困难的查询。此外，调试它是不可能的。<p>为了解决这些局限性，许多工程师发现自己添加了额外的层，如基于代理的预处理、自定义嵌入、重新排序机制和混合搜索策略。就像机器学习的早期一样，当我们手动构建特征向量以挤出边际收益时，构建一个有效的RAG系统通常成为制作工程“黑客”的练习。今年早些时候，微软提出了使用知识图进行RAG的想法，并发布了GraphRAG，即带有知识图的RAG。我们相信这个想法有着令人难以置信的潜力，但现有的实现在创建和探索图形的方式上是幼稚的。这就是为什么我们使用一种使用旧PageRank的新算法方法开发了Fast GraphRAG<p> 构建可靠的RAG系统时有两个主要挑战：<p>（1）数据噪声：现实世界的数据往往很混乱。客户支持工单、聊天记录和其他会话数据可能包含许多不相关的信息。如果将有噪声的数据推入向量数据库，则可能会得到有噪声的结果<p> （2）领域专业化：对于复杂的用例，RAG系统必须理解特定于领域的上下文。这需要创建表示，不仅要捕捉单词，还要捕捉数据中更深层次的关系和结构<p> 我们的解决方案基于这些见解，将知识图整合到RAG管道中。知识图存储实体及其关系，并可以帮助以一种能够实现更准确和上下文感知的信息检索的方式构建数据。12年前，谷歌宣布了我们都知道的知识图谱[1]。这是一个开创性的举措。现在我们有了LLM，这意味着人们终于可以使用与谷歌最初想法一样强大的工具对自己的数据进行RAG<p> 在我们建造这个之前，安东尼奥在亚马逊，而卢卡和余杭正在牛津大学完成博士学位。我们多年来一直在思考这个问题，我们一直喜欢pagerank和人类记忆之间的相似性[2]。我们认为，搜索记忆与搜索网络非常相似<p> 其工作原理如下：<p>-实体和关系提取：Fast GraphRAG使用LLM从数据中提取实体及其关系，并以图形格式存储[3]<p> -查询处理：当您进行查询时，Fast GraphRAG首先使用向量搜索查找最相关的实体，然后运行个性化的PageRank算法来确定与查询相关的最重要的“记忆”或信息[4]<p> 增量更新：与其他基于图的RAG系统不同，Fast GraphRAG原生支持增量数据插入。这意味着您可以连续添加新数据，而无需重新处理整个图形<p> -更快：这些设计选择使我们的算法比其他基于图的RAG系统运行得更快、更经济，因为我们消除了对社区和集群的需求。<p>假设你正在分析一本书，想专注于角色交互、位置和重要事件：<p><pre><code>from fast_graphrag import graphrag域=“；分析这个故事并确定人物。关注他们之间的互动方式、探索的地点以及他们之间的关系&“；示例查询=[&“；《圣诞颂歌》中平安夜的意义是什么&“；，&“；维多利亚时代伦敦的背景如何为故事做出贡献；主题&“；，&“；描述导致斯克罗吉的一系列事件；的转变&“；，&“；狄更斯是如何用不同的精神（过去、现在和未来）来指导斯克罗吉的&“；，&“；狄更斯为什么选择把故事分成“；木棍&quot；而不是章节&“；]ENTITY_TYPES=[“角色”、“动物”、“地点”、“对象”、“活动”、“事件”]grag=GraphRAG(working_dir=“&#x2F；“book_example”；，域＝域，example_queries=“；\n&quot；。加入（示例查询），实体类型=实体类型)其中open（“.xF book.txt”）为f：图形插入（f.read（））print（grag.query（“谁是斯克罗吉？”）.response）</code></pre>此代码基于您的数据、示例查询和指定的实体类型创建特定于领域的知识图。然后，您可以用简单的英语查询它，同时它会自动处理所有数据提取、实体提取、共同引用解析、内存选择等。当您添加新数据时，也会为您处理锁定和检查点<p> 这是GenAI应用程序处理大规模现实世界数据所需的基础设施。我们的目标是为您提供这种基础设施，以便您可以专注于重要的事情：为您的用户构建出色的应用程序，而无需手动设计检索管道。在托管服务中，我们还有一套UI工具供您探索和调试知识图<p> 我们有一个免费的托管解决方案，每月最多可处理100个请求。当您准备好成长时，我们会为您提供可扩展的付费计划。当然，您可以自行托管我们的开源引擎<p> 今天，请访问<a href=“https:&#x2F；circlemind.co”>https:&#x2F；circlemind.co</a>，请访问<a href=“https:”github.com“circlemind-ai”fast graphrag“>https:”查看我们的代码&#x2F；github.com；circlemind ai；fast graphrag</a><p>我们很乐意收到反馈：）<p>[1]<a href=“https:&#x2F；blog.google&#x2G；products&#x2B；search&#x2C；introduced knowledge graph things not”rel=“nofollow”>https:&#x2F；blog.google；产品；搜索&#x2F；格里菲斯，T.L.，Steyvers，M；Firl，A.（2007）。谷歌与思维：用PageRank预测流利度。《心理科学》，18（12），1069-1076<a href=“http://www.jstor.org稳定40064705”rel=“nofollow”>http:&#x2F；www.jstor.org；稳定；40064705</a><p>[3]与微软的GraphRAG类似：<a href=“https:""&#x2F；github.com；微软；graphrag</a><p>[4]与俄勒冈州立大学的HippoRAG类似：<a href=“https:&#x2F；&#x2F; github.com&#x2F-俄勒冈州立大学NLP组&#x2F/HippoRAG.”>https:&#x2F；github.com；俄勒冈州立大学NLP集团；HippoRAG</a><p><a href=“https:&#x2F；vhs.charm.sh&#x2F; vhs-4fCiggsbsc7UX0pemOcsMp.gif”rel=“nofollow”>https:&#x2F；vhs.charm.sh；vhs-4fCicgsbsc7UX0pemOcsMp.gif</a>

hn link

Url: https://github.com/circlemind-ai/fast-graphrag

由于我无法直接访问互联网或GitHub，我不能直接下载或查看`fast-graphrag`仓库的内容。不过，我可以根据你提供的链接和一般的信息来帮助你。

`fast-graphrag`似乎是一个GitHub仓库，可能包含与图神经网络（Graph Neural Networks, GNNs）相关的代码或库。以下是对该仓库可能的用途和内容的分析：

1. **用途**：该仓库可能是为了提供快速和高效的图神经网络（GNNs）实现的工具。GNNs是处理图结构数据的神经网络，在社交网络分析、推荐系统、知识图谱等领域有广泛应用。

2. **功能**：
   - **图表示学习**：可能提供将节点和边转换为神经网络可以处理的向量表示的方法。
   - **GNN架构**：可能包含各种GNN架构的实现，如GCN（图卷积网络）、GAT（图注意力网络）等。
   - **数据处理**：可能包含用于处理和转换图数据的工具，如图预处理、图采样等。
   - **性能优化**：可能包含针对性能优化和加速的代码，比如使用CUDA进行GPU加速。

3. **总结**：
   - 如果该仓库包含文档，它可能提供了如何使用这些工具的指南。
   - 代码可能遵循良好的编程实践，包括模块化和可复用性。
   - 示例可能展示了如何将`fast-graphrag`集成到不同的机器学习工作流程中。

要获取关于`fast-graphrag`仓库的具体信息，你需要访问GitHub链接。以下是一般步骤：

1. 访问 [https://github.com/circlemind-ai/fast-graphrag](https://github.com/circlemind-ai/fast-graphrag)。
2. 查看仓库的README文件，以了解其用途和如何安装。
3. 浏览仓库中的代码，查看提供的类、函数和示例。
4. 如果有相关文档，阅读它们以了解库的详细使用方法。

如果你需要将非中文内容翻译成中文，可以使用在线翻译工具或编程库，如Google翻译API或其他支持多语言的翻译服务。请注意，自动翻译可能不完全准确，因此最好结合人工校对来确保翻译质量。

Post by: liukidar

Comments:

LASR: So I've done a ton of work in this area.Few learnings I've collected:1. Lexical search with BM25 alone gives you very relevant results if you can do some work during ingestion time with an LLM.2. Embeddings work well only when the size of the query is roughly on the same order of what you're actually storing in the embedding store.3. Hypothetical answer generation from a query using an LLM, and then using that hypothetical answer to query for embeddings works really well.So combining all 3 learnings, we landed on a knowledge decomposition and extraction step very similar to yours. But we stick a metaprompter to essentially auto-generate the domain / entity types.LLMs are naively bad at identifying the correct level of granularity for the decomposed knowledge. One trick we found is to ask the LLM to output a mermaid.js mindmap to hierarchically break down the input into a tree. At the end of that output, ask the LLM to state which level is the appropriate root for a knowledge node.Then the node is used to generate questions that could be answered from the knowledge contained in this node. We then index the text of these questions and also embed them.You can directly match the user's query from these questions using purely BM25 and get good outputs. But a hybrid approach works even better, though not by that much.Not using LLMs are query time also means we can hierarchically walk down the root into deeper and deeper nodes, using the embedding similiarity as a cost function for the traversal.

LASR: 因此，我；我在这方面做了大量的工作 我学到的很少；我收集到：1。如果你能在摄入LLM期间做一些工作，那么仅使用BM25进行词汇搜索就会给你带来非常相关的结果 2。只有当查询的大小与您的查询大小大致相同时，嵌入才能很好地工作；重新实际存储在嵌入存储中 3。使用LLM从查询中生成假设答案，然后使用该假设答案查询嵌入，效果非常好 因此，结合所有3个学习，我们得出了一个与你非常相似的知识分解和提取步骤。但我们坚持使用元提示器来自动生成域；实体类型 LLM在识别分解知识的正确粒度级别方面非常糟糕。我们发现的一个技巧是让LLM输出一个mermaid.js思维导图，将输入按层次分解为树。在该输出的末尾，要求LLM说明哪个级别是知识节点的适当根 然后，该节点用于生成可以从该节点中包含的知识中回答的问题。然后，我们对这些问题的文本进行索引，并嵌入其中 您可以直接匹配用户；使用纯BM25从这些问题中查询，并获得良好的输出。但混合方法效果更好，尽管不是那么好 不使用LLM是查询时间，这也意味着我们可以使用嵌入相似性作为遍历的成本函数，分层地沿着根遍历到越来越深的节点。

jillesvangurp: Cool idea. IMHO traditional information retrieval is the way to go with RAG. Vector search is nice but also slow and expensive and people seem to use it as magic pixie dust. It works nice for unstructured data but not necessarily that well for structured data.And unless tuned very well, vector search is not actually a whole lot better than a good old well tuned query. Putting everything together, the practice of turning structured data into unstructured data just so you can do vector search or prompt engineering on it, which I've seen teams do, feels a bit backwards. It kind of works but there are probably smarter ways to get the same results. Graph RAG is essentially about making use of structure of data. Whether that's through SQL joins or by querying some graph database doesn't really matter much.There is probably some value into teaching LLMs how to query as well; or letting them interface with existing search/query APIs. And you can compensate for poor ranking with larger context sizes and simply fetch a few hundred or even more results with multiple queries. It's going to be a lot faster and cheaper than vector search to scale that.

jillesvangurp: 好主意。依我之见，传统的信息检索是RAG的发展方向。矢量搜索很好，但速度慢、成本高，人们似乎把它当作神奇的仙尘。它适用于非结构化数据，但不一定适用于结构化数据 除非调整得很好，否则向量搜索实际上并不比一个经过良好调整的旧查询好多少。将所有内容放在一起，将结构化数据转换为非结构化数据的做法，这样您就可以对其进行矢量搜索或提示工程，我称之为；我见过球队这样做，感觉有点倒退。这有点奏效，但可能有更聪明的方法来获得同样的结果。Graph RAG本质上是关于利用数据的结构。这是否；s通过SQL连接或通过查询某些图形数据库来实现；这真的不重要 教授法学硕士如何查询可能也有一定的价值；或者让他们与现有的搜索进行交互；查询API。你可以用更大的上下文大小来弥补排名不佳，只需通过多个查询获取几百个甚至更多的结果。它；要扩大规模，这将比矢量搜索更快、更便宜。

michelpp: PageRank is one of several interesting centrality metrics that could be applied to a graph to influence RAG on structural data, another one is Triangle Centrality which counts triangles around nodes to figure out their centrality based on the concept that triangles close relationships into a strong bond, where open bonds dilute centrality by drawing weight away from the center:<a href="https://arxiv.org/abs/2105.00110" rel="nofollow">https://arxiv.org/abs/2105.00110</a>The paper shows high efficiency compared to other centralities like PageRank, however in some research using the GraphBLAS I and my coauthors found that TC was slower on a variety of sparse graphs than our sparse formulation of PR for graphs up to 1.8 billion edges, but that TC appears to scale better as graphs get larger and is likely more efficient in the trillion edge realm.<a href="https://fossies.org/linux/SuiteSparse/GraphBLAS/Doc/The_GraphBLAS_in_Julia_and_Python_the_PageRank_and_Triangle_Centralities.pdf" rel="nofollow">https://fossies.org/linux/SuiteSparse/GraphBLAS/Doc/The_Grap...</a>

michelpp: PageRank是可以应用于图以影响RAG对结构数据的几个有趣的中心性度量之一，另一个是三角形中心性，它计算节点周围的三角形，以根据三角形将密切关系转化为强键的概念来计算它们的中心性，其中开放键通过将权重从中心转移来稀释中心性：<a href=“https:"arxiv.org."; abs."于2105.00110”rel=“nofollow”>https:&#x2F；arxiv.org；abs；2105.00110</a>与PageRank等其他中心化相比，这篇论文显示出了很高的效率，但在一些使用GraphBLAS I和我的合著者的研究中发现，TC在各种稀疏图上的速度比我们对多达18亿条边的图的PR的稀疏公式慢，但随着图变大，TC的扩展性似乎更好，在万亿条边领域可能更有效 <a href=“https:#x2F；fossies.org；linux#x2F，SuiteSpears；GraphBLAS；Doc；The_GraphBLAS_in_Julia_and_Python_PageRank_and_Triangle_Centralities.pdf”rel=“nofollow”>https:&#x2F；fossies.org；linux&#x2F；SuiteParse&#x2F；GraphBLAS；Doc#x2F；The_Grap</一

AIorNot: This is very cool, I signed up and uploaded a few docs (PDFs) to the dashboardOur Use case: We have been looking at farming out this work (analyzing complaince documents (manufacturing paperwork) for our AI Startup however we need to understand the potential scale this can operate under and the cost model for it to be useful to usWe will have about 300K PDF documents per client and expect about a 10% change in that document set, month to month -any GraphRag system has to handle documents at scale - we can use S3 as an igestion mechanism but have to understand the cost and processing time needed for the system to be ready to use duiring:1. inital loading
2. regular updates -how do we delete data from system for examplecool framework btw..

AIorNot: 这很酷，我注册了并上传了一些文档（PDF）到仪表板我们的用例：我们一直在研究这项工作（为我们的人工智能初创公司分析投诉文档（制造文书），但我们需要了解这项工作的潜在规模以及它对我们有用的成本模型每个客户将有大约30万个PDF文档，预计该文档集每月会有大约10%的变化——任何GraphRag系统都必须大规模处理文档——我们可以使用S3作为一种估算机制，但必须了解系统准备好使用所需的成本和处理时间：1。初始载荷2.定期更新-我们如何从系统中删除数据，例如cool framework btw。。

bionhoward: Since when does “good old PageRank” demand an OpenAI API key?“You may not: use Output to develop models that compete with OpenAI” => they’re gonna learn from you and you can’t learn from them.Glad we’re all so cool with longterm economic downfall of natural humans. Our grandkids might not be so glad about it!

bionhoward: 从什么时候开始，“好的老PageRank”需要一个OpenAI API密钥 “您可能不会：使用Output开发与OpenAI竞争的模型”=&gt；他们会向你学习，而你不能向他们学习 很高兴我们对自然人的长期经济衰退都如此冷静。我们的孙子们可能不会那么高兴！