【Hacker News搬运】LLaMA现在在CPU上运行得更快

hackernews

Title: LLaMA Now Goes Faster on CPUs

LLaMA现在在CPU上运行得更快

Text:

Url: https://justine.lol/matmul/

LLaMA 语言模型的一个新版本，llamafile，已经发布，其中包含了 84 个新的矩阵乘法内核，使得它的速度比上一个版本，llama.cpp，更快。这个新版本在 ARMv8.2+、Intel 和 AVX512 计算机上特别快，提示评估时间比之前快了 30% 到 500%。这个项目的作者 Justine 在 2023 年 11 月与 Mozilla 开始了这个项目，并一直在努力改进 LLaMA 的核心技术，以提供更好的用户体验。性能提升在不同的硬件类型之间的比较中得到了展示，包括企业级、爱好者、游戏和专业硬件。

作者描述了他们在自己的 Threadripper 计算机上测试一套八根 RAM 内存条的性能的经历，将其与他们的 Mac Studio 和 Intel 计算机进行比较。Threadripper 在内存速度上超过了 Mac Studio，但在磁盘速度和整体系统性能上较慢。作者还讨论了他们在优化 CPU 上矩阵乘法操作的性能的工作，分享了他们的源代码，并解释了他们尝试的各种方法，包括使用 BLAS 库和展开循环来利用指令级并行性。作者的更改被纳入了 llama.cpp 项目中，使得在令牌生成和提示处理速度方面取得了显著的性能提升。

文中描述了一个使用向量化外积和 OpenMP 进行并行的矩阵乘法函数的 C++ 实现。该函数针对 513x512 矩阵乘法进行了优化，并在具有 6400 MT/s RAM 的 Alderlake i9-14900K CPU 上实现了 810 吉戈浮点运算。该实现使用了 AVX 指令，并设计为与 llama.cpp 线程模型一起工作，后者类似于 GPU。代码包括一个新的内核框架，并可以导出 C API 到 GGML，在不具有传统 BLAS 库的延迟劣势的情况下实现了 790 吉戈浮点运算。文本还提供了在 Linux 上基准测试 llamafile 的说明。

文中讨论了为 Apple Silicon 设备构建名为 "llama.cpp" 的 C++ 文件的流程，涉及特定命令以在 CPU 模式下运行它。作者解释说，他们一直在用汇编语言编码，使用 Emacs 推动一个按钮并查看他们正在工作的 C++ 代码的汇编。他们还提到，通过观看其他人开发 CUDA 内核，他们学习编写数学内核的过程。作者感谢那些帮助开发 llamafile 项目的人，并提到他们的资金来源，包括 Mozilla、GitHub 赞助和 Patreon 赞助商。他们还邀请读者加入 Mozilla AI Discord 并感谢他们的支持者有机会用高质量的数学内核服务于社区。

Post by: lawrencechen

Comments:

ajtulloch: - <a href="https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html" rel="nofollow">https://www.cs.utexas.edu/users/flame/laff/pfhp/index.html</a> (e.g. here <a href="https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocking-for-registers.html" rel="nofollow">https://www.cs.utexas.edu/users/flame/laff/pfhp/week2-blocki...</a>)- <a href="https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0" rel="nofollow">https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184...</a>might be of interest

ajtulloch: -<a href=“https://；&#x2F；www.cs.utexas.edu&#x2F：users&#x2F！flam&#x2F，laff&#x20F；pfhp&#x2F。index.html”rel=“nofollow”>https://&#x2F；www.cs.utexas.edu；users&#x2F；火焰；laff；pfhp&#x2F；index.html</a><a href=“https://；&#x2F；gist.github.com#xx2F；nadavrot&#x2F：5b35d44e8ba3dd718e595e40184d03f0”rel=“nofollow”>https://&#x2F；gist.github.com&#x2F；nadavrot&#x2F；5b35d44e8ba3d718e595e40184…</a>可能会感兴趣

bottlepalm: I think it's a good idea for everyone to download and be able to run a LLM locally, even if you have the minimum of requirements. As a pseudo-backup of a large chunk of human knowledge.

bottlepalm: 我认为；对于每个人来说，下载并能够在本地运行LLM是一个好主意，即使您有最低的要求。作为一大块人类知识的伪备份。

kiratp: It fascinating to me that coming up on a year since Sapphire Rapids has been available in the public cloud, developers are still targeting AVX512 when they should be targeting VNNI and AMX.<a href="https://github.com/ggerganov/llama.cpp/issues/2555">https://github.com/ggerganov/llama.cpp/issues/2555</a>

kiratp: 我很感兴趣的是，在Sapphire Rapids在公共云中推出一年后，开发人员仍在瞄准AVX512，而他们本应瞄准VNNI和AMX <a href=“https://；&#x2F；github.com&#x2F！ggerganov&#x2F：llama.cpp&#x2F，issues&#x25”>https://&#x2F；github.com&#x2F；格尔加诺夫；骆驼.cpp；问题；2555</a>

pama: Super nice story on the matmul optimization that gave 810 gflops for 512x512. Thanks for the write up and the contributions to llama.cpp and the community more broadly.

pama: 关于matmul优化的超级好故事，它为512x512提供了810 gflops。感谢您的来信以及对llama.cpp和更广泛的社区的贡献。

1-6: Question is, how much of an improvement has it gotten to over a GPU or ASIC?

1-6: 问题是，它比GPU或ASIC有多大的改进？