【Hacker News搬运】迈向1位机器学习模型

hackernews

Title: Towards 1-bit Machine Learning Models

迈向1位机器学习模型

Text:

Url: https://mobiusml.github.io/1bit_blog/

这篇博客文章主要讨论了在极端低位宽量化设置下，如何通过微调（fine-tuning）来提高小型预训练模型的输出质量。文章指出，直接对小型模型如Llama2-7B进行1位量化会得到不尽如人意的结果，但当模型经过微调后，其输出质量会有显著提高。作者还探讨了如何在低位量化中利用矩阵乘法的优化形式，以及如何通过低秩适配器（LoRA/QLoRA）方法微调大型模型。

文章中提到的实验结果包括：

- 在wikitext基准测试中，1-位量化模型经过微调后的性能优于未微调的Quip# 2-位量化模型。
- 在2-位量化方面，作者提出的方法已经优于现有的Quip#量化方法，并且在微调后能够显著降低perplexity，提高语言建模的性能。
- 对于聊天模型，1-位量化模型在某些基准测试中表现不佳，但微调后能够显著提高性能，与 full-precision 模型相比较。

此外，文章还引发了一个新的讨论：量化模型与小型模型之间的选择。作者指出，虽然训练小型模型可以减少计算资源和训练时间，但通过使用如HQQ+的量化技术，可以获得更优越的性能，同时保持相对较小的内存占用。

总结来说，这篇博客文章展示了在极端低位量化条件下，通过微调方法可以显著提高模型输出质量的潜力，并呼吁开发能够充分利用这种方法的软件和硬件。

Post by: homarp

Comments:

vladf: Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.Interesting, what is "group-size of 8"?From their HQQ post (<a href="https://mobiusml.github.io/hqq_blog/" rel="nofollow">https://mobiusml.github.io/hqq_blog/</a>), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).So for every 8 binary weights we have a 16-bit scale and 8-bit shift?> Fine-tuning with Low-Rank AdaptersThey say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?Then, the reported 7B sizes, in GB:> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)those numbers would make sense if it was actually 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.From their HF page:> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.

vladf: 非常强的二进制结果。太强了，有点可疑。我希望有人能在下面解释我的困惑 &gt；我们使用8的组大小比较了Llama2-7B型号在三种配置中的性能：FP16（全精度）、HQQ（无微调）和HQQ+（带适配器层）；组大小为8” 从他们的HQQ帖子（<a href=“https://x2F；&#x2F；mobiusml.github.io&#x2F：HQQ_blog&#x2F”rel=“nofollow”>https://x2F：&#x2F；s是他们添加缩放（大概是16位）和移位（在那篇文章中，它是8位）的块大小 那么，对于每8个二进制权重，我们都有一个16位的标度和8位的移位 &gt；使用低秩适配器进行微调他们说他们将移位内联到LoRA中，但如何在不增加num个块（他们声称只使用1个额外的秩）的情况下，按块执行 然后，报告的7B大小，以GB:&gt；13.5（fp16）1.76（HQQ 1位）1.85（HQQ+1-位）2.72（quip#2位）如果它实际上是_1位，那么这些数字是有意义的；d更像3位。从他们的HF页面：&gt；此版本将元数据卸载到CPU，因此只有二进制权重和低秩适配器存储在GPU内存中 很有趣，所以我们必须回到CPU重新缩放？他们是这样计算GB的吗？这一点本应在表格中明确说明。我也很惊讶，如果他们乒乓球到CPU，他们的延迟会比quip低。

londons_explore: I believe the future is 1 bit models - for both training and inference.When people make custom silicon for 1 bit models, they'll find that it is sooooo much more power and silicon-space efficient to do 1 bit math than 16 bit floating point - like 100x or more.That extra model size will vastly overshadow any worse performance of the models.

londons_explore: 我相信未来是1位模型——用于训练和推理 当人们为1位模型制作定制硅时；我发现，与16位浮点（如100x或更多）相比，进行1位数学运算的功率和硅空间效率要高得多 额外的模型尺寸将大大掩盖模型的任何较差性能。

mmoskal: It seems the trick here is they first quantize it to 1- or 2-bit, and then they fine-tune the quantization bias parameters (the parameters that dequantize from 1-2 to 16 bit) via LoRA. Then they have specialized kernels to do matrix multiplication at the bit level.Also, the 2-bit model seems much better than the 1-bit model - they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of the 1.58b paper (which claims -1 is def. needed).

mmoskal: 这里的技巧似乎是，他们首先将其量化为1-或2-比特，然后通过LoRA微调量化偏置参数（从1-2比特反量化到16比特的参数）。然后，他们有专门的内核在比特级别进行矩阵乘法 此外，2位模型似乎比1位模型好得多——它们使用[-1，0，1，2]-我想知道；2；鉴于15.8亿美元的论文（其中声称-1是定义所需的）。

danielhanchen: Highly respect HQQ team's work - same accuracy as GPTQ / AWQ and with no activation aware tuning calibration part - ie no more 3 hour calibration runs! A big fan of the 4bit ones especially, and the 3,2, and now 1bit ones!Also super cool idea of 1bit needing some calibration like AWQ - no calibration data shows very bad results, but with some LoRA adapters and finetuning, a great recovery of performance is possible.Planning to add support for this inside Unsloth to make all low bit finetuning 2x faster and save tonnes of VRAM!

danielhanchen: 高度尊重HQQ团队；s的工作-与GPTQ；AWQ，无需激活感知调谐校准部件，即无需再运行3小时校准！我是4位的超级粉丝，尤其是3、2和现在的1位 还有一个非常酷的想法，1比特需要一些校准，比如AWQ——没有校准数据显示出非常糟糕的结果，但有了一些LoRA适配器和微调，性能有可能得到很大的恢复 计划在Uncloth内部添加对此的支持，使所有低位微调速度加快2倍，并节省数吨VRAM！

WithinReason: 1-bit weights have been a thing since at least 2016:<a href="https://arxiv.org/abs/1606.06160" rel="nofollow">https://arxiv.org/abs/1606.06160</a>

WithinReason: 至少自2016年以来，1位权重就一直存在：<a href=“https://；&#x2F；arxiv.org/&#x2F！abs&#x2F：1606.06160”rel=“nofollow”>https://&#x2F；arxiv.org/x2F；abs；1606.06160</a>