【Hacker News搬运】差动变压器
-
Title: Differential Transformer
差动变压器
Text:
Url: https://arxiv.org/abs/2410.05258
很抱歉,由于我是一个文本和信息处理的AI,我无法直接访问或抓取外部网站的内容,包括arXiv论文的全文。不过,我可以提供一些指导,告诉你如何使用JinaReader或其他工具来抓取和总结内容。 以下是一个使用JinaReader进行内容抓取和总结的一般步骤: 1. **设置JinaReader**: - 确保你已经安装了JinaReader。你可以通过pip安装它: ```bash pip install jina ``` 2. **抓取内容**: - 使用JinaReader的客户端来抓取arXiv论文的链接。 - 创建一个Jina流程来处理抓取的任务。 3. **处理内容**: - 使用NLP工具对抓取的内容进行预处理,比如去除HTML标签、清理文本等。 - 如果内容不是中文,你可以使用机器翻译服务将其翻译成中文。 4. **总结内容**: - 使用摘要或总结算法来生成内容的简短摘要。 以下是一个简化的代码示例,展示了如何使用JinaReader来抓取内容并使用简单的机器翻译API进行翻译: ```python from jina import DocumentArray, Flow from jina.executors.processing import TextPreprocessor from jina.executors.retrieval import FAISSIndex from googletrans import Translator # 创建一个Flow flow = Flow.load_config('path/to/your/flow/config.yml') # 添加一个预处理器来清理文本 flow.add(TextPreprocessor()) # 添加一个检索器,这里以FAISS为例 # 注意:你需要预先训练一个FAISS索引 flow.add(FAISSIndex()) # 使用机器翻译API translator = Translator() # 准备文档 docs = DocumentArray() docs.append({"text": "The content of the arXiv paper."}) # 处理文档 flow.run_all(docs) # 翻译文档 translated_docs = DocumentArray() for doc in docs: translated_text = translator.translate(doc.text, dest='zh-cn').text translated_docs.append({"text": translated_text}) # 输出翻译后的文档 print(translated_docs[0].text) # 进行总结(这里仅作为示例,实际情况可能需要更复杂的NLP模型) from gensim.summarization import summarize summary = summarize(translated_docs[0].text) print(summary)
请注意,上面的代码只是一个示例,实际应用中可能需要调整和优化。特别是对于翻译和总结,你可能需要使用更专业的服务或模型,比如Google Cloud Translation API和GPT-3等。此外,由于我无法访问arXiv的具体论文内容,上述代码中的
"The content of the arXiv paper."
应替换为实际抓取到的论文内容。## Post by: weirdcat ### Comments: **Imnimo**: I feel like I'm missing a key insight here. I understand the problem that regular softmax attention struggles to approach assigning zero attention to irrelevant stuff. And I get that having this subtraction formula makes it possible to assign exactly (or near) zero attention weight without having crazy outlier activations. But it seems like it also makes it very easy to have negative attention weight (which is equivalent to having positive attention weight on the negation of your value vectors). Intuitively, it just feels like a difficult balancing act to keep all the stuff you don't care about so close to zero.<p>But Figure 1 clearly shows that it works, so I don't doubt that it is in fact possible. I'm just struggling to build a picture of how exactly the network accomplishes this. > **Imnimo**: 我觉得我;我在这里缺少一个关键的见解。我理解常规softmax注意力难以将零注意力分配给无关内容的问题。我知道,有了这个减法公式,就可以精确地(或接近)分配零注意力权重,而不会出现疯狂的异常值激活。但这似乎也使得负注意力权重变得非常容易(相当于对值向量的否定有正注意力权重)。直观地说,保留所有不需要的东西感觉就像是一种艰难的平衡行为;我不在乎这么接近零<p> 但图1清楚地表明它是有效的,所以我不这么认为;毫无疑问,这实际上是可能的。我;我只是很难想象网络是如何做到这一点的。 **aDyslecticCrow**: Very clever. I like this kind of nitty-gritty detail work, and the change is small enough to be adapted easily by others. Bravo!<p>I'm a little concerned about the last sentence of the section introduction of "2 Differential Transformer". It mentions using improvements from previous papers, but in the grammatical context, it's unclear if this improvement is added to both the normal transformer and their diff transformer. This would otherwise sully the comparisons. It's the "main difference" wording in the previous sentence that raised a flag for me.<p>Of course, a good-faith researcher would know this and may not feel the need to clarify. But you can never be too careful about some published research in this field. > **aDyslecticCrow**: 非常聪明。我喜欢这种细致入微的工作,变化很小,很容易被其他人适应。好样的<p> 我;我有点担心这一节引言的最后一句&“;2差动变压器";。它提到使用了以前论文的改进,但在语法背景下,它;目前尚不清楚这种改进是否同时添加到普通变压器和它们的差动变压器中。否则,这将玷污比较。它;这是";主要区别";当然,一个真诚的研究人员会知道这一点,可能觉得没有必要澄清。但是,对于这一领域的一些已发表的研究,你永远不会太小心。 **msoad**: Like most things in this new world of Machine Learning, I'm really confused why this works?<p>The analogy to noise-cancelling headphones is helpful but in that case we clearly know which is signal and which is noise. Here, if we knew why would we even bother to the noise-cancelling work? > **msoad**: 与机器学习这个新世界中的大多数事情一样,我;我真的很困惑为什么这行得通<p> 与降噪耳机的类比是有帮助的,但在这种情况下,我们清楚地知道哪个是信号,哪个是噪声。在这里,如果我们知道为什么还要费心去做降噪工作? **islewis**: > Differential attention takes the difference between two softmax attention functions to eliminate attention noise<p>If I understand correctly, this architecture trades twice as much attention memory in exchange for either a higher quality model, or less parameters at a similar quality.<p>> According to the fitted curves, 6.8B-size DIFF Transformer achieves a validation loss comparable to 11B-size Transformer, requiring only 62.2% of parameters<p>This raises a few questions for me:<p>- Would having only 60% of the parameters negate the double space for attention, leaving a similar memory profile as a traditional transformer?<p>- Does that tradeoff change noticeably between training and inference? > **islewis**: >;差分注意力利用两个softmax注意力函数之间的差异来消除注意力噪声<p>如果我理解正确的话,这种架构交换了两倍的注意力记忆,以换取更高质量的模型,或者在类似质量下减少参数<p> >;根据拟合曲线,6.8B尺寸DIFF Transformer的验证损失与11B大小的Transformer相当,只需要62.2%的参数<p>。这给我提出了几个问题:<p>-只有60%的参数会抵消注意力的双重空间,留下与传统Transformer类似的内存配置文件吗<p> -训练和推理之间的权衡是否发生了明显变化? **iandanforth**: The key bit I didn't understand at first was what happens if the two groups of attention learn the same thing; because their attention masks are subtracted from one another if they both output similar values the attention across the board will drop to zero and this will lead to high loss. So the only way to reduce loss is if they learn to attend to different things. One of the simplest strategies they could learn (and this paper claims that they do) is for one group to focus on relevant context and the other to focus on irrelevant context. Thus one group learns the noise and the other the signal (it's not this cut and dry but is a useful simplification for understanding IMO). > **iandanforth**: 我没有的关键点;起初我不明白的是,如果两组注意力学习相同的东西会发生什么;因为如果它们都输出相似的值,它们的注意力蒙版就会相互抵消,所以注意力将全面降至零,这将导致高损失。因此,减少损失的唯一方法是他们学会处理不同的事情。他们可以学习的最简单的策略之一(本文声称他们确实如此)是让一组人专注于相关的背景,另一组人关注不相关的背景。因此,一组学习噪声,另一组学习信号(这不是一成不变的,但对理解IMO来说是一个有用的简化)。