【Hacker News搬运】RNN是我们所需要的吗？

hackernews

Title: Were RNNs all we needed?

RNN是我们所需要的吗？

Text:

Url: https://arxiv.org/abs/2410.01201

很抱歉，作为一个AI，我无法直接访问外部网站或下载其内容，包括访问arXiv上的论文。不过，我可以提供一些指导，帮助你如何使用JinaReader这样的工具来抓取和分析内容，并对其进行总结。

以下是一个使用JinaReader分析arXiv论文的示例流程：

1. **安装JinaReader**：
   如果你还没有安装JinaReader，你可以在你的机器上通过pip安装它：
   ```bash
   pip install jinareader

获取论文内容：
通常，你需要从arXiv网站获取论文的PDF文件。这可以通过浏览器下载，或者使用命令行工具如wget或curl。
使用JinaReader处理论文：
使用JinaReader，你可以将PDF文件转换为可以分析的格式，然后进行文本提取、分析等操作。

以下是一个简单的代码示例，展示如何使用JinaReader来提取PDF中的文本：
```
from jinareader import PDFReader

reader = PDFReader()
text = reader.read("path_to_your_paper.pdf")
print(text)
```

翻译非中文内容：
如果你需要将提取的英文文本翻译成中文，你可以使用翻译API，如Google Translate API。以下是一个使用Google Translate API的简单示例：

from google.cloud import translate_v2 as translate

client = translate.Client()

def translate_text(text, target='zh-CN'):
    result = client.translate(text, target_language=target)
    return result['translatedText']

translated_text = translate_text(text)
print(translated_text)

总结内容：
一旦你有了翻译后的文本，你可以使用简单的文本分析技术来提取关键点，从而总结内容。以下是一个简单的例子：
```
from nltk.tokenize import sent_tokenize

sentences = sent_tokenize(translated_text)
summary = ' '.join(sentences[:5])  # 只使用前5句话来总结
print(summary)
```

请注意，上面的代码仅作为示例，你需要根据实际的情况调整路径、API密钥和总结逻辑。在实际应用中，你可能需要对文本进行更复杂的处理，例如使用自然语言处理（NLP）技术来提取主题、实体或进行情感分析。

        
## Post by: beefman
        
### Comments: 
        
**xnx**: It&#x27;s curse and a blessing that discussion of topics happens in so many different places. I found this comment on Twitter&#x2F;X interesting: <a href="https:&#x2F;&#x2F;x.com&#x2F;fchollet&#x2F;status&#x2F;1841902521717293273" rel="nofollow">https:&#x2F;&#x2F;x.com&#x2F;fchollet&#x2F;status&#x2F;1841902521717293273</a><p>&quot;Interesting work on reviving RNNs. <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2410.01201" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2410.01201</a> -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren&#x27;t fundamentally important in the curve-fitting paradigm (aka deep learning)<p>Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve&#x27;s shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime.&quot;
> **xnx**: 它；这既是诅咒，也是祝福，因为话题的讨论发生在这么多不同的地方。我在推特上发现了这条评论&#x2F；X有趣：<a href=“https:&#x2F；X.com&#x2F&#fchollet&#x20; status&#x21; 1841902521717293273”rel=“nofollow”>https:&#x2F；x.com；fchollet；状态；1841902521717293273</a><p>&quot；关于复兴RNN的有趣工作<a href=“https://arxiv.org.abs2410.01201”rel=“nofollow”>https://&#x2F；arxiv.org；abs；2410.01201</a>——一般来说，有许多来自不同方向的最新架构与Transformers大致匹配，这一事实证明了架构不是；t在曲线拟合范式（也称为深度学习）中具有根本重要性<p>曲线拟合是将数据集嵌入曲线上。关键因素是数据集，而不是限制曲线的特定硬编码花里胡哨；s形。只要你的曲线具有足够的表现力，所有架构在大数据领域都会收敛到相同的性能&“；
            
**bob1029**: &gt; Transformers required ~2.5x more training steps to achieve comparable performance, overfitting eventually.<p>&gt; RNNs are particularly suitable for sequence modelling settings such as those involving time series, natural language processing, and other sequential tasks where context from previous steps informs the current prediction.<p>I would like to draw an analogy to digital signal processing. If you think of the recurrent-style architectures as IIR filters and feedforward-only architectures as FIR filters, you will likely find many parallels.<p>The most obvious to me being that IIR filters typically require far fewer elements to produce the same response as an equivalent FIR filter. Granted, the FIR filter is often easier to implement&#x2F;control&#x2F;measure in practical terms (fixed-point arithmetic hardware == ML architectures that can run on GPUs).<p>I don&#x27;t think we get to the exponential scary part of AI without some fundamentally recurrent architecture. I think things like LSTM are kind of an in-between hack in this DSP analogy - You could look at it as FIR with dynamic coefficients. Neuromorphic approaches seem like the best long term bet to me in terms of efficiency.
> **bob1029**: &gt；变压器需要大约2.5倍的训练步骤才能达到可比的性能，最终会过度拟合<p> &gt；RNN特别适用于序列建模设置，例如涉及时间序列、自然语言处理和其他顺序任务的设置，其中来自先前步骤的上下文通知了当前的预测<p> 我想把它比作数字信号处理。如果您将循环式架构视为IIR滤波器，将仅前馈架构视为FIR滤波器，您可能会发现许多相似之处<p> 对我来说最明显的是，IIR滤波器通常需要更少的元件来产生与等效FIR滤波器相同的响应。当然，FIR滤波器通常更容易实现；控制；实际测量（定点算术硬件==可以在GPU上运行的ML架构）<p> 我不知道；我认为，如果没有一些基本上重复的架构，我们就无法达到人工智能的指数级可怕部分。我认为LSTM之类的东西在这个DSP类比中是一种介于两者之间的黑客——你可以把它看作是具有动态系数的FIR。就效率而言，神经形态方法似乎是我最好的长期选择。
            
**trott**: My feeling is that the answer is &quot;no&quot;, in the sense that these RNNs wouldn&#x27;t be able to universally replace Transformers in LLMs, even though they might be good enough in some cases and beat them in others.<p>Here&#x27;s why.<p>A user of an LLM <i>might</i> give the model some long text and then say &quot;Translate this into German please&quot;. A Transformer can look back at its whole history. But what is an RNN to do? While the length of its context is unlimited, the amount of information the model retains about it is bounded by whatever is in its hidden state at any given time.<p>Relevant: <a href="https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.01032" rel="nofollow">https:&#x2F;&#x2F;arxiv.org&#x2F;abs&#x2F;2402.01032</a>
> **trott**: 我的感觉是，答案是&quot；“否”；，从某种意义上说，这些RNN将；无法在LLM中普遍取代变压器，即使它们在某些情况下可能足够好，在其他情况下可能会击败它们<p> 这里；这就是为什么<p> LLM<i>的用户可能会给模型一些长文本，然后说“；请把这个翻译成德语&quot；。Transformer可以回顾其整个历史。但是RNN能做什么呢？虽然其上下文的长度是无限的，但模型保留的关于它的信息量在任何给定时间都受到其隐藏状态的限制<p> 相关信息：<a href=“https:”arxiv.org:”abs:”2402.01032“rel=”nofollow“>https:”&#x2F；arxiv.org；abs；2402.01032</a>
            
**mkaic**: I strongly enjoy the simplicity of their &quot;minGRU&quot; architecture. It&#x27;s basically just:<p><pre><code>  class MinGRU(nn.Module):
    def __init__(self, token_size, hidden_state_size):
      self.token_to_proposal = nn.Linear(token_size, hidden_size)
      self.token_to_mix_factors = nn.Linear(token_size, hidden_size)

    def forward(self, previous_hidden_state, current_token):
      proposed_hidden_state = self.token_to_proposal(current_token)
      mix_factors = torch.sigmoid(self.token_to_mix_factors(current_token))
      return torch.lerp(proposed_hidden_state, previous_hidden_state, mix_factors)
</code></pre>
And since the proposed hidden states and mix factors for each layer are both only dependent on the current token, you can compute all of them in parallel if you know the whole sequence ahead of time (like during training), and then combine them in linear time using parallel scan.<p>The fact that this is competitive with transformers and state-space models in their small-scale experiments is gratifying to the &quot;best PRs are the ones that delete code&quot; side of me. That said, we won&#x27;t know for sure if this is a capital-B Breakthrough until someone tries scaling it up to parameter and data counts comparable to SOTA models.<p>One detail I found really interesting is that they seem to do all their calculations in log-space, according to the Appendix. They say it&#x27;s for numerical stability, which is curious to me—I&#x27;m not sure I have a good intuition for why running everything in log-space makes the model more stable. Is it because they removed the tanh from the output, making it possible for values to explode if calculations are done in linear space?<p>EDIT: Another thought—it&#x27;s kind of fascinating that this sort of sequence modeling works at all. It&#x27;s like if I gave you all the pages of a book individually torn out and in a random order, and asked you to try to make a vector representation for each page as well as instructions for how to mix that vector with the vector representing all previous pages — except you have zero knowledge of those previous pages. Then, I take all your page vectors, sequentially mix them together in-order, and grade you based on how good of a whole-book summary the final vector represents. Wild stuff.<p>FURTHER EDIT: Yet <i>another</i> thought—right now, they&#x27;re just using two dense linear layers to transform the token into the proposed hidden state and the lerp mix factors. I&#x27;m  curious what would happen if you made those transforms MLPs instead of singular linear layers.
> **mkaic**: 我非常喜欢它们的简单&quot；minGRU”；建筑。它；s基本上只是：<p><pre><code>类MinGRU（nn.Module）：def __init__（self、token_size、hidden_state_size）：self.token_to_proposal=nn。线性（token_size，hidden_size）self.token_to_mix_factors=nn。线性（token_size，hidden_size）向前定义（self、previous_hidden_state、current_token）：proposed_hidden_state=self.token_to_provement（当前令牌）mix_factors=torch.sigmoid（self.token_to_mix_factors（current_token））return torch.lerp（提议中间状态、先前中间状态、混合因素）</code></pre>由于每一层的拟议隐藏状态和混合因子都只依赖于当前令牌，如果你提前知道整个序列（比如在训练期间），你可以并行计算所有这些状态和混合因素，然后使用并行扫描在线性时间内组合它们<p> 这一事实与变压器和状态空间模型在其小规模实验中的竞争是令人欣慰的；最好的PR是那些删除代码的PR；也就是说，我们赢了；在有人尝试将其扩展到与SOTA模型相当的参数和数据计数之前，我无法确定这是否是一项重大突破<p> 根据附录，我发现一个非常有趣的细节是，他们似乎在日志空间中完成了所有的计算。他们说：；s代表数值稳定性，这对我来说很奇怪——；我不确定为什么在日志空间中运行所有内容会使模型更稳定。是因为他们从输出中删除了tanh，如果在线性空间中进行计算，值可能会爆炸吗<p> 编辑：另一个想法——它；这种序列建模居然有效，这有点令人着迷。它；就像我给你一本书的所有页面，分别撕下并随机排列，让你试着为每一页制作一个向量表示，以及如何将该向量与表示所有前一页的向量混合的说明——除非你对前几页一无所知。然后，我取你所有的页面向量，按顺序将它们混合在一起，并根据最终向量代表的整本书摘要的好坏对你进行评分。野生的东西<p> 进一步编辑：然而<i>另一个</i>想法——现在，他们；re只是使用两个密集的线性层将令牌转换为所提出的隐藏状态和lerp混合因子。我；我很好奇，如果你用MLP而不是单个线性层来进行这些变换，会发生什么。
            
**vandahm**: I made a RNN for a college project because I was interested in obsolete historical technology and I thought I needed to seize the opportunity while it lasted, because once I was out of school, I&#x27;d never hear about neural networks ever again.<p>Mine worked, but it was very simple and dog slow, running on my old laptop. Nothing was ever going to run fast on that thing, but I remember my RNN being substantially slower than a feed-forward network would have been.<p>I was <i>so confident</i> that this was dead technology -- an academic curiosity from the 1980s and 1990s. It was bizarre to see how quickly that changed.
> **vandahm**: 我为一个大学项目制作了一个RNN，因为我对过时的历史技术感兴趣，我认为我需要抓住这个机会，因为一旦我离开学校，我就会；我再也没听说过神经网络<p> 我的工作，但它非常简单和狗慢，运行在我的旧笔记本电脑上。在那件事上，没有什么能跑得快，但我记得我的RNN比前馈网络慢得多<p> 我非常自信，这是一种死技术——一种20世纪80年代和90年代的学术好奇心。看到这种变化有多快，真是奇怪。