【Hacker News搬运】显示HN:加快LLM推理2倍(可能)
-
Title: Show HN: Speeding up LLM inference 2x times (possibly)
显示HN:加快LLM推理2倍(可能)
Text: Here's a project I've been working on for the last few months.<p>It's a new (I think) algorithm, that allows to adjust smoothly - and in real time - how many calculations you'd like to do during inference of an LLM model.<p>It seems that it's possible to do just 20-25% of weight multiplications instead of all of them, and still get good inference results.<p>I implemented it to run on M1/M2/M3 GPU. The mmul approximation itself can be pushed to run 2x fast before the quality of output collapses.<p>The inference speed is just a bit faster than Llama.cpp's, because the rest of implementation could be better, but with a better development I think it can be a new method to speed up inference - in addition to quantization.<p>You could call it ad-hoc model distillation :)<p>You can change the speed / accuracy of a model at will, in real time.<p>Oh, and as a side effect, the data format allows to also choose how much of the model you want to load into the memory. You can decide to skip say 10-20-40% of the least important weights.<p>It's implemented for Mistral, it was also tested slightly on Mixtral and Llama. It's for FP16 for now, but Q8 is in the works.<p>The algorithm is described here, and the implementation is open source.<p><a href="https://kolinko.github.io/effort/" rel="nofollow">https://kolinko.github.io/effort/</a><p>I know these are bold claims, but I hope they survive the scrutiny
这里;s是一个项目I;过去几个月我一直在努力<p> 它;这是一种新的(我认为)算法,它允许平滑地实时调整您的计算次数;我想在LLM模型的推理过程中做的事情<p> 似乎它;可以只进行20-25%的权重相乘,而不是全部相乘,并且仍然可以获得良好的推理结果<p> 我将其实现为在M1;M2;M3 GPU。mmul近似本身可以在输出质量崩溃之前快速运行2倍<p> 推理速度仅比Llama.cpp快一点;s、 因为实现的其余部分可能会更好,但随着更好的发展,我认为这可能是一种新的加速推理的方法——除了量化<p> 你可以称之为ad-hoc模型蒸馏:)<p>你可以改变速度;模型的准确性随意,实时<p> 哦,作为一个副作用,数据格式还允许您选择要加载到内存中的模型的多少。你可以决定跳过最不重要的重量的10-20-40%<p> 它;它在Mistral上实施,也在Mixtral和Llama上进行了轻微测试。它;目前是FP16,但Q8正在制作中<p> 这里描述了算法,并且实现是开源的<p> <a href=“https://;/;kolinko.github.io/:努力/”rel=“nofollow”>https:///;kolinko.github.io;努力</a> <p>我知道这些都是大胆的说法,但我希望它们能经得起审查:)
Url: https://asciinema.org/a/piP22yYwcaohu5cA2gyuv1W61
抓取的内容是关于“Effort Engine”的演示,具体是在macOS操作系统下,使用xterm-256color终端和zsh shell的情况。视频观看次数为22589次。内容中提到了Effort算法的快速预览,并提供了更多详细信息的链接:[Effort算法详情](https://kolinko.github.io/effort/)。 Effort算法涉及到的概念是努力,它指的是完成某项任务或达到某个目标时投入的能量、力量或心力。这种努力可以是物理的、精神的或情感的,它涉及到坚持、决心和专注。所需的努力量可以根据任务的不同而有很大的差异,并且可能受到技能水平、动机和可用资源等因素的影响。总的来说,付出努力意味着采取行动并努力实现某事,而不仅仅是打算这样做。 视频中还提到了Effort不是固定数量的,它涉及到克服障碍和达到期望结果所需的能量、时间和资源。在个人成长、工作和教育等生活的各个领域,努力都是取得成功的重要因素。 由于抓取的内容包含非中文语言,已翻译成中文以便理解。
Post by: kolinko
Comments:
spencerchubb: I love this line in the gpu implementation section.<p>"Readers fresh to GPU programming may ask now - how does it work?<p>Readers experienced with GPU programming may ask - how the hell does it work?"
spencerchubb: 我喜欢gpu实现部分的这一行<p> ”;刚接触GPU编程的读者现在可能会问——它是如何工作的<p> 有GPU编程经验的读者可能会问——它到底是怎么工作的";
brrrrrm: I've been trying to think about how you'd amp up the batch size here. it's a bit tricky since the memory access would be way higher, but I think you can actually still save on compute by chunking things up in a clever way to utilize the tensorcores
brrrrrm: I-;我一直在思考你是如何做到的;在这里放大批量。它;这有点棘手,因为内存访问量会高得多,但我认为你实际上仍然可以通过巧妙地利用tensorcores来节省计算量
hatthew: This seems similar to semi-structured (aka 2:4) sparsity, may be worth explicitly comparing. As far as I can tell by skimming, your technique:<p>- is optimized for apple silicon
- ~2x speed at 75% sparsity
- dynamic, depends on input, applied at runtime
- can choose amount of sparsity<p>And 2:4 semi-structured sparsity:<p>- is optimized for GPUs with sparse tensor cores (nvidia ampere and beyond)
- ~2x speed at 50% sparsity
- static, applied to the model at rest
- probably worse results than your technique at 50% sparsity<p>The interesting comparison I'd want to see is semi-structured sparsity results (50% sparsity, 2x speedup) vs your results at 75% sparsity (2x speedup).
hatthew: 这似乎类似于半结构化(又名2:4)稀疏性,可能值得明确比较。据我所知,你的技术:<p>-是针对苹果硅优化的-75%稀疏度下的速度约为2倍-动态的,取决于输入,在运行时应用-可以选择稀疏性的数量<p>和2:4半结构化稀疏性:<p>-针对具有稀疏张量核(nvidia-amper及以上)的GPU进行了优化-50%稀疏度下的速度约为2倍-静态,应用于静止的模型-可能在50%稀疏性下比你的技术更差的结果<p>有趣的比较I;我想看到的是半结构化稀疏性结果(50%稀疏性,2倍加速)与75%稀疏性的结果(2倍加速率)。
marmaduke: Having used CSR it's not surprising, and some newer formats might have more mechanical sympathy like block ELL, since they avoid uncoalesced reads / gathers, tho the code is trickier.
marmaduke: 在使用CSR之后;这并不奇怪,并且一些较新的格式可能具有更机械的同情,如块ELL,因为它们避免了未恢复的读取;收集,尽管代码更棘手。
bigcat12345678: “““
So instead, let's flip the matrix, sort the elements row-wise, and revisit the multiplications from that direction.<p>This is called a Compressed Sparse Row (CSR) format by the smart people. To do the multiplication now, we take, say, the 1 from the vector, multiply it by 256, and add it into the output vector at the 3rd row. And so on.<p>Now, let's see what happens if we truncate the last column - the one with the lowest values.
”””<p>How does csr works with reduced numbers multiplication?bigcat12345678: “““因此,相反,让;我们翻转矩阵,按行对元素进行排序,然后从那个方向重新进行乘法运算<p> 这被聪明人称为压缩稀疏行(CSR)格式。现在要进行乘法运算,我们从向量中取1,乘以256,然后将其添加到第三行的输出向量中。等等。<p>现在,让;让我们看看如果截断最后一列(值最低的一列)会发生什么。“”“<p>csr如何与减号乘法一起工作?