【Hacker News搬运】迈向1位机器学习模型
-
Title: Towards 1-bit Machine Learning Models
迈向1位机器学习模型
Text:
Url: https://mobiusml.github.io/1bit_blog/
这篇博客文章主要讨论了在极端低位宽量化设置下,如何通过微调(fine-tuning)来提高小型预训练模型的输出质量。文章指出,直接对小型模型如Llama2-7B进行1位量化会得到不尽如人意的结果,但当模型经过微调后,其输出质量会有显著提高。作者还探讨了如何在低位量化中利用矩阵乘法的优化形式,以及如何通过低秩适配器(LoRA/QLoRA)方法微调大型模型。 文章中提到的实验结果包括: - 在wikitext基准测试中,1-位量化模型经过微调后的性能优于未微调的Quip# 2-位量化模型。 - 在2-位量化方面,作者提出的方法已经优于现有的Quip#量化方法,并且在微调后能够显著降低perplexity,提高语言建模的性能。 - 对于聊天模型,1-位量化模型在某些基准测试中表现不佳,但微调后能够显著提高性能,与 full-precision 模型相比较。 此外,文章还引发了一个新的讨论:量化模型与小型模型之间的选择。作者指出,虽然训练小型模型可以减少计算资源和训练时间,但通过使用如HQQ+的量化技术,可以获得更优越的性能,同时保持相对较小的内存占用。 总结来说,这篇博客文章展示了在极端低位量化条件下,通过微调方法可以显著提高模型输出质量的潜力,并呼吁开发能够充分利用这种方法的软件和硬件。
Post by: homarp
Comments:
vladf: Really strong binary results. So strong it was fishy. I hope someone can explain my confusion below.<p>> We compared the performance of the Llama2-7B model in three configurations: FP16 (full precision), HQQ (without fine-tuning), and HQQ+ (with adapter layers) using a group-size of 8.<p>Interesting, what is "group-size of 8"?<p>From their HQQ post (<a href="https://mobiusml.github.io/hqq_blog/" rel="nofollow">https://mobiusml.github.io/hqq_blog/</a>), it's the block size at which they add scales (presumably 16-bit) and shifts (in that post, it's 8-bit).<p>So for every 8 binary weights we have a 16-bit scale and 8-bit shift?<p>> Fine-tuning with Low-Rank Adapters<p>They say they inline the shift into the LoRA but how can you do this, block-wise, without increasing your LoRA rank by num-blocks (they claim to only use 1 additional rank)?<p>Then, the reported 7B sizes, in GB:<p>> 13.5 (fp16) 1.76 (HQQ 1-bit) 1.85 (HQQ+ 1-bit) 2.72 (quip# 2-bit)<p>those numbers would make sense if it was actually 1 bit. But if you include the overhead of 16-bit scales (and why is the shift inlineable into lora? still unexplained) it'd be more like 3-bit.<p>From their HF page:<p>> This version offloads the meta-data to the CPU, so only the binary weights and the low-rank adapters are stored in the GPU memory.<p>Interesting, so we have to go back to CPU to rescale? Is this how they counted GB? This should have been clearly caveated in the table. I also am amazed they got latency lower than quip if they pingpong to CPU.
vladf: 非常强的二进制结果。太强了,有点可疑。我希望有人能在下面解释我的困惑<p> >;我们使用8的组大小比较了Llama2-7B型号在三种配置中的性能:FP16(全精度)、HQQ(无微调)和HQQ+(带适配器层);组大小为8”<p> 从他们的HQQ帖子(<a href=“https://x2F;/;mobiusml.github.io/:HQQ_blog/”rel=“nofollow”>https://x2F:/;s是他们添加缩放(大概是16位)和移位(在那篇文章中,它是8位)的块大小<p> 那么,对于每8个二进制权重,我们都有一个16位的标度和8位的移位<p> >;使用低秩适配器进行微调<p>他们说他们将移位内联到LoRA中,但如何在不增加num个块(他们声称只使用1个额外的秩)的情况下,按块执行<p> 然后,报告的7B大小,以GB:<p>>;13.5(fp16)1.76(HQQ 1位)1.85(HQQ+1-位)2.72(quip#2位)<p>如果它实际上是_1位,那么这些数字是有意义的;d更像3位。<p>从他们的HF页面:<p>>;此版本将元数据卸载到CPU,因此只有二进制权重和低秩适配器存储在GPU内存中<p> 很有趣,所以我们必须回到CPU重新缩放?他们是这样计算GB的吗?这一点本应在表格中明确说明。我也很惊讶,如果他们乒乓球到CPU,他们的延迟会比quip低。
londons_explore: I believe the future is 1 bit models - for both training and inference.<p>When people make custom silicon for 1 bit models, they'll find that it is <i>sooooo</i> much more power and silicon-space efficient to do 1 bit math than 16 bit floating point - like 100x or more.<p>That extra model size will vastly overshadow any worse performance of the models.
londons_explore: 我相信未来是1位模型——用于训练和推理<p> 当人们为1位模型制作定制硅时;我发现,与16位浮点(如100x或更多)相比,进行1位数学运算的功率和硅空间效率要高得多<p> 额外的模型尺寸将大大掩盖模型的任何较差性能。
mmoskal: It seems the trick here is they first quantize it to 1- or 2-bit, and then they fine-tune the quantization bias parameters (the parameters that dequantize from 1-2 to 16 bit) via LoRA. Then they have specialized kernels to do matrix multiplication at the bit level.<p>Also, the 2-bit model seems much better than the 1-bit model - they use [-1, 0, 1, 2] - I wonder if '2' is needed in light of the 1.58b paper (which claims -1 is def. needed).
mmoskal: 这里的技巧似乎是,他们首先将其量化为1-或2-比特,然后通过LoRA微调量化偏置参数(从1-2比特反量化到16比特的参数)。然后,他们有专门的内核在比特级别进行矩阵乘法<p> 此外,2位模型似乎比1位模型好得多——它们使用[-1,0,1,2]-我想知道;2;鉴于15.8亿美元的论文(其中声称-1是定义所需的)。
danielhanchen: Highly respect HQQ team's work - same accuracy as GPTQ / AWQ and with no activation aware tuning calibration part - ie no more 3 hour calibration runs! A big fan of the 4bit ones especially, and the 3,2, and now 1bit ones!<p>Also super cool idea of 1bit needing some calibration like AWQ - no calibration data shows very bad results, but with some LoRA adapters and finetuning, a great recovery of performance is possible.<p>Planning to add support for this inside Unsloth to make all low bit finetuning 2x faster and save tonnes of VRAM!
danielhanchen: 高度尊重HQQ团队;s的工作-与GPTQ;AWQ,无需激活感知调谐校准部件,即无需再运行3小时校准!我是4位的超级粉丝,尤其是3、2和现在的1位<p> 还有一个非常酷的想法,1比特需要一些校准,比如AWQ——没有校准数据显示出非常糟糕的结果,但有了一些LoRA适配器和微调,性能有可能得到很大的恢复<p> 计划在Uncloth内部添加对此的支持,使所有低位微调速度加快2倍,并节省数吨VRAM!
WithinReason: 1-bit weights have been a thing since at least 2016:<p><a href="https://arxiv.org/abs/1606.06160" rel="nofollow">https://arxiv.org/abs/1606.06160</a>
WithinReason: 至少自2016年以来,1位权重就一直存在:<p><a href=“https://;/;arxiv.org//!abs/:1606.06160”rel=“nofollow”>https:///;arxiv.org/x2F;abs;1606.06160</a>