【Hacker News搬运】显示HN：加快LLM推理2倍（可能）

hackernews

Title: Show HN: Speeding up LLM inference 2x times (possibly)

显示HN：加快LLM推理2倍（可能）

Text: Here's a project I've been working on for the last few months.It's a new (I think) algorithm, that allows to adjust smoothly - and in real time - how many calculations you'd like to do during inference of an LLM model.It seems that it's possible to do just 20-25% of weight multiplications instead of all of them, and still get good inference results.I implemented it to run on M1/M2/M3 GPU. The mmul approximation itself can be pushed to run 2x fast before the quality of output collapses.The inference speed is just a bit faster than Llama.cpp's, because the rest of implementation could be better, but with a better development I think it can be a new method to speed up inference - in addition to quantization.You could call it ad-hoc model distillation :)You can change the speed / accuracy of a model at will, in real time.Oh, and as a side effect, the data format allows to also choose how much of the model you want to load into the memory. You can decide to skip say 10-20-40% of the least important weights.It's implemented for Mistral, it was also tested slightly on Mixtral and Llama. It's for FP16 for now, but Q8 is in the works.The algorithm is described here, and the implementation is open source.<a href="https://kolinko.github.io/effort/" rel="nofollow">https://kolinko.github.io/effort/</a>I know these are bold claims, but I hope they survive the scrutiny

这里；s是一个项目I；过去几个月我一直在努力<p> 它；这是一种新的（我认为）算法，它允许平滑地实时调整您的计算次数；我想在LLM模型的推理过程中做的事情<p> 似乎它；可以只进行20-25%的权重相乘，而不是全部相乘，并且仍然可以获得良好的推理结果<p> 我将其实现为在M1；M2；M3 GPU。mmul近似本身可以在输出质量崩溃之前快速运行2倍<p> 推理速度仅比Llama.cpp快一点；s、 因为实现的其余部分可能会更好，但随着更好的发展，我认为这可能是一种新的加速推理的方法——除了量化<p> 你可以称之为ad-hoc模型蒸馏：）＜p＞你可以改变速度；模型的准确性随意，实时<p> 哦，作为一个副作用，数据格式还允许您选择要加载到内存中的模型的多少。你可以决定跳过最不重要的重量的10-20-40%<p> 它；它在Mistral上实施，也在Mixtral和Llama上进行了轻微测试。它；目前是FP16，但Q8正在制作中<p> 这里描述了算法，并且实现是开源的<p> <a href=“https://；&#x2F；kolinko.github.io&#x2F：努力&#x2F”rel=“nofollow”>https://&#x2F；kolinko.github.io；努力</a> <p>我知道这些都是大胆的说法，但我希望它们能经得起审查：）

hn link

Url: https://asciinema.org/a/piP22yYwcaohu5cA2gyuv1W61

抓取的内容是关于“Effort Engine”的演示，具体是在macOS操作系统下，使用xterm-256color终端和zsh shell的情况。视频观看次数为22589次。内容中提到了Effort算法的快速预览，并提供了更多详细信息的链接：[Effort算法详情](https://kolinko.github.io/effort/)。

Effort算法涉及到的概念是努力，它指的是完成某项任务或达到某个目标时投入的能量、力量或心力。这种努力可以是物理的、精神的或情感的，它涉及到坚持、决心和专注。所需的努力量可以根据任务的不同而有很大的差异，并且可能受到技能水平、动机和可用资源等因素的影响。总的来说，付出努力意味着采取行动并努力实现某事，而不仅仅是打算这样做。

视频中还提到了Effort不是固定数量的，它涉及到克服障碍和达到期望结果所需的能量、时间和资源。在个人成长、工作和教育等生活的各个领域，努力都是取得成功的重要因素。

由于抓取的内容包含非中文语言，已翻译成中文以便理解。

Post by: kolinko

Comments:

spencerchubb: I love this line in the gpu implementation section."Readers fresh to GPU programming may ask now - how does it work?Readers experienced with GPU programming may ask - how the hell does it work?"

spencerchubb: 我喜欢gpu实现部分的这一行 ”；刚接触GPU编程的读者现在可能会问——它是如何工作的 有GPU编程经验的读者可能会问——它到底是怎么工作的&quot；

brrrrrm: I've been trying to think about how you'd amp up the batch size here. it's a bit tricky since the memory access would be way higher, but I think you can actually still save on compute by chunking things up in a clever way to utilize the tensorcores

brrrrrm: I-；我一直在思考你是如何做到的；在这里放大批量。它；这有点棘手，因为内存访问量会高得多，但我认为你实际上仍然可以通过巧妙地利用tensorcores来节省计算量

hatthew: This seems similar to semi-structured (aka 2:4) sparsity, may be worth explicitly comparing. As far as I can tell by skimming, your technique:- is optimized for apple silicon

~2x speed at 75% sparsity
dynamic, depends on input, applied at runtime
can choose amount of sparsityAnd 2:4 semi-structured sparsity:- is optimized for GPUs with sparse tensor cores (nvidia ampere and beyond)
~2x speed at 50% sparsity
static, applied to the model at rest
probably worse results than your technique at 50% sparsityThe interesting comparison I'd want to see is semi-structured sparsity results (50% sparsity, 2x speedup) vs your results at 75% sparsity (2x speedup).

hatthew: 这似乎类似于半结构化（又名2:4）稀疏性，可能值得明确比较。据我所知，你的技术：-是针对苹果硅优化的-75%稀疏度下的速度约为2倍-动态的，取决于输入，在运行时应用-可以选择稀疏性的数量和2:4半结构化稀疏性：-针对具有稀疏张量核（nvidia-amper及以上）的GPU进行了优化-50%稀疏度下的速度约为2倍-静态，应用于静止的模型-可能在50%稀疏性下比你的技术更差的结果有趣的比较I；我想看到的是半结构化稀疏性结果（50%稀疏性，2倍加速）与75%稀疏性的结果（2倍加速率）。

marmaduke: Having used CSR it's not surprising, and some newer formats might have more mechanical sympathy like block ELL, since they avoid uncoalesced reads / gathers, tho the code is trickier.

marmaduke: 在使用CSR之后；这并不奇怪，并且一些较新的格式可能具有更机械的同情，如块ELL，因为它们避免了未恢复的读取；收集，尽管代码更棘手。

bigcat12345678: “““
So instead, let's flip the matrix, sort the elements row-wise, and revisit the multiplications from that direction.This is called a Compressed Sparse Row (CSR) format by the smart people. To do the multiplication now, we take, say, the 1 from the vector, multiply it by 256, and add it into the output vector at the 3rd row. And so on.Now, let's see what happens if we truncate the last column - the one with the lowest values.
”””How does csr works with reduced numbers multiplication?

bigcat12345678: “““因此，相反，让；我们翻转矩阵，按行对元素进行排序，然后从那个方向重新进行乘法运算 这被聪明人称为压缩稀疏行（CSR）格式。现在要进行乘法运算，我们从向量中取1，乘以256，然后将其添加到第三行的输出向量中。等等。＜p＞现在，让；让我们看看如果截断最后一列（值最低的一列）会发生什么。“”“csr如何与减号乘法一起工作？