【Hacker News搬运】DenseFormer:增强变压器中的信息流

hackernews

Title: DenseFormer: Enhancing Information Flow in Transformers

DenseFormer:增强变压器中的信息流

Text:

Url: https://arxiv.org/abs/2402.02622

标题：DenseFormer：通过深度加权平均增强变压器中的信息流动
作者：提交日期为2024年2月4日（版本1），最后修订日期为2024年3月21日（此版本，版本2）
发布日期：无
顶部图片链接：无
文本：

摘要：Vaswani等人（2017年）提出的变压器架构目前在各个应用领域都无处不在，从自然语言处理到语音处理和图像理解。我们提出了DenseFormer，这是对标准架构的一种简单修改，它改进了模型的困惑度而不增加其大小——对于大型模型，在100B参数范围内增加了几千个参数。我们的方法依赖于每个变压器块之后的一个额外的平均步骤，它计算当前和过去表示的加权平均——我们将这个操作称为深度加权平均（DWA）。学习到的DWA权重显示出信息流动的清晰模式，揭示了来自不同层远的激活的强烈和结构化的重复使用。实验证明，DenseFormer更加数据高效，达到与深度变压器模型相同的困惑度，而对于相同的困惑度，这些新模型在内存效率和推理时间上超过了变压器基线。

提交历史：从Amirkeivan Mohtashami[查看电子邮件]                  [v1]
    Sun, 4 Feb 2024 21:44:09 UTC (1,154 KB)
[v2]
    Thu, 21 Mar 2024 10:57:40 UTC (4,007 KB)

Post by: tipsytoad

Comments:

valine: The architecture changes are very straight forward. Model merging has shown that pre-trained transformer layers are very robust. I’ll bet it’s possible to fine tune a pre-trained model like mistral to use this architecture. That would enable someone to test it with more parameters without training a whole new base model.

valine: 架构的更改非常直接。模型合并已经表明，预先训练的变换器层是非常健壮的。我敢打赌，可以微调像mistral这样的预训练模型来使用这种架构。这将使人们能够用更多的参数来测试它，而无需训练一个全新的基础模型。

tbalsam: This is a very interesting idea, with DenseNets there are oftentimes some terrible memory gotchas that have gotten me over the past 7-8 years or so, so a part of me is sorta leaning back waiting for some memory usage shoe to drop not specified in the paper (even with the activation patterns!)<p>However, maybe this is not the case. I have a bit of a history of messing with residuals in neural networks, seeing more work on it is good. Fast training networks of course are a very slightly mild obsession of mine as well, and very useful to the field. Here's hoping it pans out as a motif, curious to see where it goes.

tbalsam: 这是一个非常有趣的想法，在过去7-8年左右的时间里，DenseNet经常会出现一些可怕的记忆问题，所以我的一部分身体有点向后倾斜，等待一些没有在论文中指定的记忆使用鞋掉下来（即使是激活模式！）<p>然而，也许情况并非如此。我有一点在神经网络中处理残差的历史，看到更多的工作是好的。当然，快速训练网络也是我的一个非常轻微的痴迷，对这个领域非常有用。这里；希望它能成为一个主题，好奇地想知道它会去哪里。

p1esk: This method has only been tested on tiny models (<1B) and tiny dataset (17B tokens). It’s not clear if it scales.

p1esk: 该方法仅在微小模型（&lt；1B）和微小数据集（17B令牌）上进行了测试。目前尚不清楚它是否会扩大。

sp332: Even better is the result on page 7 that perplexity drops faster by wall-clock time. Even if you're getting fewer iterations per hour of rented GPU time, you're still coming out ahead in model performance.

sp332: 更好的是第7页的结果，困惑比墙上的时钟时间下降得更快。即使你；在租用的GPU时间内，每小时的迭代次数更少；在车型性能方面仍然遥遥领先。

ml_basics: Cool paper. Really interesting to see how even quite straightforward architectural modifications haven't yet all been exhausted yet, despite all the resources being poured into LLMs

ml_basics: 酷纸。真的很有趣的是，即使是非常简单的架构修改也没有；尽管所有的资源都投入到LLM中，但还没有全部耗尽