【Hacker News搬运】展示HN:Llama 3.2稀疏自动编码器的可解释性

hackernews

Title: Show HN: Llama 3.2 Interpretability with Sparse Autoencoders

展示HN:Llama 3.2稀疏自动编码器的可解释性

Text: I spent a lot of time and money on this rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary LLMs that was quite popular this year and produced great research papers by Anthropic [1], OpenAI [2] and Deepmind [3].I am quite proud of this project and since I consider myself the target audience for HackerNews did I think that maybe some of you would appreciate this open research replication as well. Happy to answer any questions or face any feedback.Cheers[1] <a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html" rel="nofollow">https://transformer-circuits.pub/2024/scaling-monosemanticit...</a>[2] <a href="https://arxiv.org/abs/2406.04093" rel="nofollow">https://arxiv.org/abs/2406.04093</a>[3] <a href="https://arxiv.org/abs/2408.05147" rel="nofollow">https://arxiv.org/abs/2408.05147</a>

我在这个相当大的副项目上花费了大量的时间和金钱，该项目试图复制今年非常流行的专有LLM的机械可解释性研究，并由Anthropic[1]、OpenAI[2]和Deepmind[3]发表了出色的研究论文<p> 我为这个项目感到非常自豪，既然我认为自己是HackerNews的目标受众，我想也许你们中的一些人也会欣赏这种开放式研究的复制。很乐意回答任何问题或面对任何反馈<p> 干杯<p>[1]<a href=“https:”变压器电路.pub“2024”缩放单信号性“index.html”rel=“nofollow”>https:”&#x2F；变压器电路；2024年；定标单辐射计</a> <p>[2]<a href=“https:”arxiv.org“abs:”2406.04093“rel=”nofollow“>https:”&#x2F；arxiv.org；abs；2406.04093</a><p>[3]<a href=“https:”arxiv.org“abs:”2408.05147“rel=”nofollow“>https:”&#x2F；arxiv.org；abs；2408.05147</a>

hn link

Url: https://github.com/PaulPauls/llama3_interpretability_sae

由于我无法直接访问互联网来查看外部链接的内容，我将基于提供的GitHub链接和描述来提供信息。

GitHub链接：https://github.com/PaulPauls/llama3_interpretability_sae

该链接指向的是一个GitHub仓库，据推测，这个项目可能与以下内容相关：

1. **项目名称**：llama3_interpretability_sae
   - "llama3" 可能指的是某种类型的模型或库，可能与语言模型（如LLaMA，即Low Latency Mixture of Experts with Autoregressive Models）有关。
   - "interpretability" 指的是模型的可解释性，即理解模型决策过程的能力。
   - "sae" 可能代表某种特定的算法或技术，如自编码器（Self-Attentive Encoder）。

2. **项目描述**：由于我无法访问具体内容，我将提供一个假设性的描述。
   - 该项目可能是一个用于改进或实现语言模型（如LLaMA）的可解释性的研究项目。
   - 它可能涉及使用自编码器技术来增强模型的可解释性，从而帮助用户理解模型的决策过程。
   - 项目可能包含代码、文档和实验结果，用于展示如何使用SAE（自编码器）来提高LLaMA模型的可解释性。

3. **项目内容**：以下是一个可能的仓库内容列表：
   - **代码**：实现自编码器与LLaMA模型结合的代码。
   - **文档**：如何使用该项目的指南，包括安装、设置和运行说明。
   - **数据集**：用于训练和测试模型的数据集。
   - **模型权重**：训练好的模型权重文件。
   - **实验结果**：使用该项目的实验结果和分析。

4. **总结**：
   - 该项目旨在通过结合自编码器技术来增强LLaMA模型的可解释性。
   - 项目可能对那些对语言模型可解释性感兴趣的研究人员和开发者有用。
   - 要获取更详细的信息，需要访问GitHub仓库并查看其中的文件和文档。

请注意，以上内容是基于项目名称和常见术语的推测。要获取准确的信息，请访问提供的GitHub链接并阅读项目的具体内容。

Post by: PaulPauls

Comments:

foundry27: For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth.Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.

foundry27: 对于以前没有见过这一点的人来说，机械可解释性解决了LLM的一个非常常见的问题：当你要求一个模型解释自己时，你正在玩一场修辞游戏，在这个游戏中，模型试图通过基于其训练数据中的模式生成一个听起来合理的答案来“说服”你它所做的事情的原因。但与大多数基准数字随着模型的改进而变得更好的趋势不同，更强大的模型在旨在自我检测“不真实性”的测试中往往得分更低，因为它们有更强的修辞，因此在事后为谎言辩护时更具说服力。目标是连贯性，而不是真理 修辞不是推理。真正的可解释性，就像过度拟合的稀疏自动编码器所声称的那样，基本上导致了模型在产生答案时所经历的“思想”的因果序列。同样，当你思考任何事情时，你可能会有一堆不同方向的短暂想法。

monkeycantype: Thank you for posting this PaulPauls,can I please ask a wacky question that I have about mech.interp. ?we know that when we use a metric to measure humans, we game the metric, I wonder about future ai, gaming mech.interp.for simplicity let me propose a 2d matrix to encode tokens, and hope my understanding of neural networks is good enough for this to be a valid question
I understand that higher dimensions have a lot more 'space', so accidental collisions are easily avoided, but 2d makes my example easier to explain.if we had the following tokens mapped in 2d space<pre><code> Apple 1a
Pear 1b
Donkey 2a
Horse 2b

</code></pre>
it becomes impossible to understand if the neurons 1,2,a,b, all excited represents apple and horse or donkey and pear?I wonder if a future, vastly more competent AI overseeing its own training might use this form of ambiguity as means to create a model in which the weights are selected so the potential for encoding collisions exist, so that it is possible to deceive an mech.int. observer as to the true thoughts of the model, essentially enabling the ai to think in euphemisms?

monkeycantype: 感谢您发布这个PaulPauls，我可以问一个关于mech.interp的古怪问题吗 我们知道，当我们用一个指标来衡量人类时，我们是在玩这个指标，我想知道未来的人工智能，游戏机制 为了简单起见，让我提出一个二维矩阵来编码令牌，并希望我对神经网络的理解足够好，使其成为一个有效的问题我明白，更高的维度意味着更多；空间；，因此，意外碰撞很容易避免，但2d使我的例子更容易解释 如果我们在二维空间中映射了以下令牌<pre><code>Apple 1a梨1b驴2a马2b</code></pre>我们无法理解神经元1,2、a、b是否都兴奋了，代表苹果和马，还是驴和梨 我想知道，未来一个更有能力监督自己训练的人工智能是否会使用这种形式的模糊性来创建一个模型，在这个模型中选择权重，从而存在编码冲突的可能性，这样就有可能欺骗机器观察者，让他们知道模型的真实想法，本质上使人工智能能够用委婉语思考？

jwuphysics: Incredible, well-documented work -- this is an amazing effort!Two things that caught my eye were (i) your loss curves and (ii) the assessment of dead latents. Our team also studied SAEs -- trained to reconstruct dense embeddings of paper abstracts rather than individual tokens [1]. We observed a power-law scaling of the lower bound of loss curves, even when we varied the sparsity level and the dimensionality of the SAE latent space. We also were able to totally mitigate dead latents with an auxiliary loss, and we saw smooth sinusoidal patterns throughout training iterations. Not sure if these were due to the specific application we performed (over paper abstracts embeddings) or if they represent more general phenomena.[1] <a href="https://arxiv.org/abs/2408.00657" rel="nofollow">https://arxiv.org/abs/2408.00657</a>

jwuphysics: 令人难以置信的、有据可查的工作——这是一项了不起的努力 引起我注意的两件事是（i）你的损失曲线和（ii）对死亡潜伏期的评估。我们的团队还研究了SAE——经过训练，可以重建论文摘要的密集嵌入，而不是单个标记[1]。我们观察到损失曲线下限的幂律缩放，即使我们改变了SAE潜在空间的稀疏度和维数。我们还能够通过辅助损失完全减轻死延迟，并且在整个训练迭代中看到了平滑的正弦模式。不确定这些是由于我们执行的特定应用（通过论文摘要嵌入）还是它们代表了更普遍的现象 [1]<a href=“https:”arxiv.org“abs”2408.00657“rel=”nofollow“>https:”&#x2F；arxiv.org；abs；2408.00657</a>

curious_cat_163: Hey - Thanks for sharing!Will take a closer look later but if you are hanging around now, it might be worth asking this now. I read this blog post recently:<a href="https://adamkarvonen.github.io/machine_learning/2024/06/11/sae-intuitions.html" rel="nofollow">https://adamkarvonen.github.io/machine_learning/2024/06/11/s...</a>And the author talks about challenges with evaluating SAEs. I wonder how you tackled that and where to look inside your repo for understanding the your approach around that if possible.Thanks again!

curious_cat_163: 嘿，谢谢分享 稍后会仔细看看，但如果你现在还在闲逛，现在可能值得问这个问题。我最近读了这篇博客文章：<a href=“https:adamkarvonen.github.io机器学习2024年06年11 sae intuitions.html”rel=“nofollow”>https:&#x2F；adamkarvonen.github.io；机器学习；2024年；06；11*F；s…</a>作者还谈到了评估严重不良事件的挑战。我想知道你是如何解决这个问题的，如果可能的话，在哪里查看你的仓库以了解你的方法 再次感谢！

Eliezer: This seems like decent alignment-positive work on a glance, though I haven't checked full details yet. I probably can't make it happen, but how much would someone need to pay you to make up your time, expense, and risk?

Eliezer: 乍一看，这似乎是不错的对齐效果，尽管我还没有；我还没有检查全部细节。我可能可以；我不会让它发生，但有人需要付给你多少钱来弥补你的时间、费用和风险？