【Hacker News搬运】您只需要更多的代理：LLM性能随代理数量而变化

hackernews

Title: More Agents Is All You Need: LLMs performance scales with the number of agents

您只需要更多的代理：LLM性能随代理数量而变化

Text:

Url: https://arxiv.org/abs/2402.05120

标题：更多代理就是你所需要的
作者：于2024年2月3日提交
发布日期：未提供
顶部图片链接：无
文本：

通过采样和投票方法，我们发现大型语言模型（LLM）的性能随着实例化代理的数量而扩展。此外，这种方法与现有复杂的进一步增强LLM的方法垂直，而增强的程度与任务难度相关。我们在广泛的LLM基准测试上进行综合实验，以验证我们的发现，并研究可以促进其发生的特性。我们的代码公开可获得：\url{https://anonymous.4open.science/r/more_agent_is_all_you_need}。
提交历史：来自叶德恒[查看电子邮件]          [v1]
2024年2月3日星期六 05:55:24 UTC（2,521 KB）

Post by: TaurenHunter

Comments:

phire: I'm not sure people in these comments are reading this paper correctly.This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.

phire: I-；I’我不确定这些评论中的人是否正确阅读了这篇论文 这似乎从本质上否定了多智能体设置的整个想法，如思想链和LLM辩论 因为本文介绍了他们的替代方法，即在同一LLM上多次运行同一查询，而不需要在查询之间共享任何上下文。然后他们对答案运行相似算法，并选择最常见的答案 （这对我来说很有道理。如果LLM给你的是“幻觉”和正确答案的混合物，正确答案将相似，幻觉有望是混乱的）这个简单的算法与所有其他多智能体算法一样好（有时更好） 这表明具有其巧妙提示的其他多智能体方案是；不要真的做任何特别的事情；它们的改进结果主要来自这样一个事实，即LLM运行多次，提示要求LLM选择最佳答案。

infogulch: This seems related to an interesting recent ACM ByteCast podcast episode with Edward Chang, an Adjunct Professor in the Department of Computer Science at Stanford University. [1] (Note there is a transcript if you don't want to listen.)The approach he uses is to arrange for multiple LLMs to dialogue between each other about a discussion topic where the human acts as a moderator instead of the question/answer format that LLMs commonly take today. They find that the final answer that multiple LLMs come to in dialogue results in a huge improvement in both precision and accuracy for the same resources.[1]: <a href="https://learning.acm.org/bytecast/ep50-edward-y-chang" rel="nofollow">https://learning.acm.org/bytecast/ep50-edward-y-chang</a>

infogulch: 这似乎与斯坦福大学计算机科学系副教授Edward Chang最近在ACM ByteCast播客上的一个有趣的插曲有关。[1] （注意，如果你不想听，会有一份笔录。）他使用的方法是安排多个LLM就一个讨论主题进行对话，其中人类充当主持人，而不是问题；LLM今天通常采用的答案格式。他们发现，多个LLM在对话中得出的最终答案会大大提高相同资源的准确性和准确性 [1]：<a href=“https://；&#x2F；learning.acm.org.&#x20F；bytecast&#x2F：ep50-edward-y-chang”rel=“nofollow”>https://&#x2F；learning.acm.org；字节广播；ep50-爱德华·杨</a>

kromem: Finally. I've been saying that we need to stop focusing on a single agent getting everything right and instead layer agents for about 16 months now, but it's great to have a paper to point to.It's interesting that the diminishing returns for tasks flatten out rapidly around the same size as the ideal human meeting sizes: <a href="https://www.researchgate.net/figure/18-Optimal-Meeting-Sizes-1_tbl6_3889414" rel="nofollow">https://www.researchgate.net/figure/18-Optimal-Meeting-Sizes...</a>If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers.I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels.Very happy to see the research community starting to step in this direction!

kromem: 最后I-；我一直在说，我们现在需要停止专注于单一代理来完成所有工作，而是分层代理大约16个月，但它；有一篇论文可供参考真是太棒了；有趣的是，任务的递减回报迅速变平，与理想的人类会议规模大致相同：<a href=“https://；&#x2F；www.researchgate.net&#x2F，figure&#x2F！18-Optimal-meeting-sizes-1_tbl6_3889414”rel=“nofollow”>https://&#x2F；www.researchgate.net；图&#x2F；18种最佳会议尺寸</a> ＜p＞如果这是在药剂量I×；我很好奇它与那些数字的匹配程度有多接近 I-；我们也非常希望看到最终的后续行动，我们可以看到当每个代理都朝着稍微不同的目标进行微调时，可以获得更多的性能。I-；d期望有；d甚至是从仅将各试剂设置在不同温度水平的性能提升 很高兴看到研究界开始朝着这个方向迈进！

nicklecompte: One frustration I've had with all this mixture-of-experts research:Randomized Algorithms 101 - or basic stochastic reasoning - suggests that if the temperature parameter is > 0, querying an LLM N times and picking the majority result (perhaps with an N+1th query to the LLM) will generally result in better performance than asking it once and choosing that result.It seems plausible to me that the gains can be further improved with a specialized mixture of different LLMs (which could then be run at temp = 0), or by finding better ways to break tasks into subtasks as this paper suggests. But AFAICT nobody has done anything to actually quantify these hypothetical gains versus the dumb randomized algorithm approach! In particular there might be voting strategies or mixtures - even specific models - where MoE/etc is strictly worse than naive repetition.I am a concerned citizen w.r.t LLMs rather than a researcher, so I might be missing something. It just seems odd that LLM researchers forgot the first chapter of Motwani/Raghavan.

nicklecompte: 一个挫折I；我们对所有这些专家的混合研究进行了研究：随机算法101——或基本的随机推理——表明如果温度参数&gt；0，查询LLM N次并选择多数结果（可能是对LLM的第N+1次查询）通常会比查询一次并选择该结果获得更好的性能 在我看来，通过不同LLM的专门混合（然后可以在temp=0下运行），或者如本文所建议的那样，通过找到更好的方法将任务分解为子任务，可以进一步提高收益。但是，与愚蠢的随机算法方法相比，AFAICT没有人做任何事情来实际量化这些假设的收益！特别地，可能存在投票策略或混合——甚至是特定的模型——其中MoE；等等绝对比天真的重复更糟糕 我是一个关心LLM的公民，而不是一个研究人员，所以我可能错过了一些东西。LLM研究人员忘记了MotwaniTM的第一章，这似乎很奇怪；拉加万。

trash_cat: Is this not an incredibly expensive/unsustainable method? I agree with the sentiment that MoE is the way to go as the newer models will probably see diminishing returns. But the compute for a single prompt will suddenly increase 7-15 fold?

trash_cat: 这不是一个非常昂贵的；不可持续的方法？我同意这样一种观点，即MoE是未来的发展方向，因为较新的模式可能会看到回报递减。但是单个提示的计算量会突然增加7-15倍？