【Hacker News搬运】深度混合：在变压器中动态分配计算

hackernews

Title: Mixture-of-Depths: Dynamically allocating compute in transformers

深度混合：在变压器中动态分配计算

Text:

Url: https://arxiv.org/abs/2404.02258

标题：Mixture-of-Depths：在基于Transformer的语言模型中动态分配计算
作者：Submitted on 2 Apr 2024
发布日期：未提供
顶部图片链接：无
文本：

摘要：基于Transformer的语言模型将浮点运算（FLOPs）均匀地分布在输入序列中。在本文中，我们展示了Transformer可以学会动态地将FLOPs（或计算资源）分配给序列中的特定位置，优化沿序列的分配，以便在不同层的模型深度上进行优化。我们的方法通过限制在给定层中可以参与自注意力（self-attention）和多层感知器（MLP）计算的令牌数量（$k$）来强制执行总计算预算。要处理的令牌由网络使用Top-$k$路由机制确定。由于$k$是预先定义的，这个简单的过程使用具有已知张量大小的静态计算图，与使用其他条件计算技术的方法不同。尽管如此，由于$k$令牌的身份是流动的，这种方法可以在时间和模型深度维度上非均匀地消耗FLOPs。因此，在总和总量上，计算支出是可以完全预测的，但在令牌级别上是动态和上下文敏感的。不仅在这种方式下训练的模型学会了动态分配计算资源，而且它们这样做也很有效率。这些模型的性能与等效FLOPS和训练时间的基线相当，但每个前向传播所需的FLOPs减少了，并且在后训练抽样期间步进的速度可以提高50%以上。

提交历史：来自Adam Santoro[查看电子邮件] [v1]
周二，2024年4月2日 19:28:11 UTC（1,763 KB）

注：由于工具限制，仅提供了部分文本内容。如果需要更多详细信息，请提供更具体的请求。

Post by: milliondreams

Comments:

whimsicalism: I think more complicated routing is absolutely going to become more common.Specifically, I think at some point we are going to move to recursive routing, ie. pass back through a set of experts again. In the future, 'chain-of-thought' will happen internal to the model recursively

whimsicalism: 我认为更复杂的路由肯定会变得越来越普遍 具体来说，我认为在某个时候我们将转向递归路由，即再次通过一组专家。在未来；思想链；将在模型内部递归发生

panqueca: Simplified Intro Version:Imagine you have a smart assistant that can understand and process the words you say to it. Usually, this assistant pays equal attention to every word you say, no matter how important or unimportant each word is to the overall meaning of your message.Now, imagine that we found a way to teach the assistant to be smarter about how it uses its "brain power." Instead of giving equal attention to every word, the assistant learns to focus more on the words that are most important for understanding what you mean. It can even adjust this focus on the fly, paying more attention to different words depending on the context of your message.To make sure the assistant doesn't get overwhelmed, we also set a limit on how much total "brain power" it can use at any given time. It's like giving the assistant a budget and saying, "You can only spend your brain power on a certain number of words at a time." The assistant then has to decide which words are most important to focus on.Even with this limit, the assistant is still flexible in how it uses its brain power. It might spend more on certain words and less on others, depending on what you're saying. This means that while we always know the total amount of brain power the assistant is using, it can adapt to different situations and prioritize what's most important.When we teach the assistant using this method, it not only learns to focus its attention intelligently but also does so very efficiently. It can understand you just as well as an assistant that pays equal attention to every word, but it uses less brain power overall. This makes the assistant much faster at responding to you and processing new information.

panqueca: 简化简介版本：想象一下，你有一个智能助手，可以理解和处理你对它说的话。通常，这个助手会同等关注你说的每一个词，无论每个词对你的信息的整体意义有多重要或不重要 现在，想象一下，我们找到了一种方法来教助手更聪明地使用它的“；脑力&quot；助理不是对每个单词都给予同等的关注，而是学会更多地关注对理解你的意思最重要的单词。它甚至可以动态调整注意力，根据信息的上下文更多地关注不同的单词 为了确保助理不会；不要不知所措，我们还设定了一个总量的限制；脑力”；它可以在任何给定的时间使用。它；I’我想给助理一份预算，然后说：“；你一次只能把你的脑力花在一定数量的单词上&quot；然后，助理必须决定哪些单词最重要。即使有这个限制，助理在使用脑力方面仍然很灵活。它可能会在某些单词上花费更多，而在其他单词上花费更少，这取决于你所做的；你在说。这意味着，虽然我们总是知道助理正在使用的脑力总量，但它可以适应不同的情况，并优先考虑；这是最重要的 当我们用这种方法教助手时，它不仅学会了聪明地集中注意力，而且非常有效。它可以像一个对每一个单词都给予同等关注的助手一样理解你，但它总体上使用的脑力较少。这使助理能够更快地对您做出响应并处理新信息。

mattmcdonagh: I wrote up a bit about it here, from what I could piece together:<a href="https://lifeinthesingularity.com/p/googles-breakthroughs-in-ai-design" rel="nofollow">https://lifeinthesingularity.com/p/googles-breakthroughs-in-...</a>

mattmcdonagh: 我在这里写了一些关于它的内容，我可以拼凑起来：<a href=“https://；&#x2F；lifeinthesingularity.com&#x2F！p&#x2F：谷歌在ai设计方面的突破”rel=“nofollow”>https://&#x2F；lifeinthesingularity.com&#x20；p；谷歌在</一

rughouse: It’s very similar to Mixture of Experts. But instead of routing tokens to multiple experts, you "deploy to a single expert which can be dynamically skipped"

rughouse: 它与“专家混合体”非常相似。但是，与其将令牌路由到多个专家；部署到可以动态跳过的单个专家“；

modeless: It's a start but it's disappointing that half the layers still have to process every token. It seems like we ought to be able to get to 90% or even 99% savings when these models currently allocate the same compute for outputting "the" as they do for outputting the first digit of the answer of a complicated math problem.

modeless: 它；这是一个开始，但它；令人失望的是，一半的层仍然必须处理每个令牌。当这些模型当前分配相同的计算用于输出“0”时，我们似乎应该能够获得90%甚至99%的节省；“；就像它们输出复杂数学问题答案的第一位数字一样。