【Hacker News搬运】量化Llama模型，速度更快，内存占用更少

hackernews

Title: Quantized Llama models with increased speed and a reduced memory footprint

量化Llama模型，速度更快，内存占用更少

Text:

Url: https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/?_fb_noscript=1

很抱歉，我无法直接访问或处理网页内容。因此，我无法使用JinaReader或其他工具来抓取和分析您提供的链接内容。

不过，我可以提供一些一般性的指导：

1. **抓取内容**：如果您想要使用JinaReader或其他工具来抓取网页内容，通常需要安装相应的库（如JinaReader）并编写代码来发送HTTP请求到目标网页，然后解析返回的HTML内容。

2. **内容分析**：抓取到的内容可能包含文本、图片、视频等多种格式。对于文本内容，可以使用自然语言处理（NLP）技术进行分析，例如提取关键词、主题、情感分析等。

3. **翻译内容**：如果内容不是中文，可以使用翻译服务（如Google Translate API）将内容翻译成中文。

以下是一个使用Python和BeautifulSoup库来抓取网页内容的基本示例：

```python
import requests
from bs4 import BeautifulSoup

# 网页URL
url = "https://ai.meta.com/blog/meta-llama-quantized-lightweight-models/?_fb_noscript=1"

# 发送HTTP请求
response = requests.get(url)

# 解析HTML内容
soup = BeautifulSoup(response.text, 'html.parser')

# 打印抓取的文本内容
print(soup.get_text())

# 如果需要翻译非中文内容，可以使用翻译API
# 注意：以下代码仅为示例，实际使用时需要替换为有效的翻译API密钥
# from googletrans import Translator
# translator = Translator()
# translated_text = translator.translate(soup.get_text(), src='auto', dest='zh-cn').text
# print(translated_text)

请注意，您需要安装requests和bs4库才能运行上述代码。如果您需要将非中文内容翻译成中文，还需要安装googletrans库。

如果您有具体的代码需求或者想了解更多关于JinaReader的使用，请提供更多信息，我会尽力帮助您。

        
## Post by: egnehots
        
### Comments: 
        
**tveita**: So SpinQuant learns a rotation for activations and weights that, to my understanding, &quot;smear&quot; the outlier weights out so you don&#x27;t get extreme values in any one weight.<p>Random anecdote warning - In the old days, before vector search became AI and everyone and their dog offered a vector database, I had a task that required nearest neighbour search in a decent amount of high-dimensional vectors.<p>I tried quantizing them to bit vectors in an index and scanning through it to get an initial set of candidates.
Performance was actually quite decent - reading through RAM linearly is fast! But the selectivity wasn&#x27;t great.<p>Somewhere along the way I found this paper[1] that iteratively finds a rotation to apply before quantization to reduce the quantization error. Very similar goal to SpinQuant, but focused on bit quantization only.<p>As it turns out the &#x27;random rotation&#x27; baseline they benchmark against worked great for my use case, so I never tried implementing the fancier algorithm. But it&#x27;s a pretty rare day at work that &quot;apply a random rotation matrix to a 128-dimensional vector&quot; is the solution to my problem.<p>[1] <a href="https:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;abstract&#x2F;document&#x2F;6296665" rel="nofollow">https:&#x2F;&#x2F;ieeexplore.ieee.org&#x2F;abstract&#x2F;document&#x2F;6296665</a> &#x2F; <a href="https:&#x2F;&#x2F;slazebni.cs.illinois.edu&#x2F;publications&#x2F;ITQ.pdf" rel="nofollow">https:&#x2F;&#x2F;slazebni.cs.illinois.edu&#x2F;publications&#x2F;ITQ.pdf</a>
> **tveita**: 因此，SpinQuant学习了激活和权重的旋转，据我所知，&quot；涂抹&quot；离群值会加权，所以你不会；任何一个权重都不会有极值<p> 随机轶事警告-在过去，在向量搜索成为人工智能之前，每个人和他们的狗都提供了一个向量数据库，我有一项任务需要在大量高维向量中搜索最近邻<p> 我尝试将它们量化为索引中的位向量，并对其进行扫描以获得初始候选集。性能实际上相当不错——线性读取RAM很快！但选择性不是；太好了<p> 在某个时候，我发现了这篇论文[1]，它迭代地找到了在量化之前应用的旋转，以减少量化误差。目标与SpinQuant非常相似，但只关注比特量化<p> 事实证明；随机旋转；他们基准测试的基线对我的用例非常有效，所以我从未尝试过实现更花哨的算法。但它；在工作中，这是非常罕见的一天；将随机旋转矩阵应用于128维向量“；这是我问题的解决方案<p> [1]<a href=“https:”ieexplore.iee.org“abstract”document“”6296665“rel=”nofollow“>https:”&#x2F；ieeexplore.ieee.org；抽象；文档；6296665</a><a href=“https:”slazebni.cs.illinois.edu“出版物”ITQ.pdf“rel=”nofollow“>https:”&#x2F；伊利诺伊州slazebni.cs.edu；出版物&quot；ITQ.pdf</a>
            
**Evidlo**: Why don&#x27;t they actually say what the size of the model is in GB?<p>That and average inference times on common hardware is what I&#x27;m curious about.
> **Evidlo**: 为什么不；他们不是说这个模型的尺寸是多少（GB）吗<p> 这和普通硬件上的平均推理时间就是我所说的；我很好奇。
            
**nisten**: It&#x27;s pretty interesting that the new SpinQuant method did not manage to be better than good old nf4bit QLORA training (Tim Dettmers really cooked with that one).<p>Really appreciate that Meta published both results+model quants and didn&#x27;t just make some bs claim about a new sota quant like most other bigger companies would&#x27;ve done.
> **nisten**: 它；非常有趣的是，新的SpinQuant方法并没有比旧的nf4bit QLORA训练更好（Tim Dettmers真的很擅长这种训练）<p> 非常感谢Meta发布了两个结果+模型定量，并没有；不要像大多数其他大公司那样，对新的索塔量化宽松政策发表一些废话；我做完了。
            
**theanonymousone**: May I ask if anyone has successfully used 1B and 3B models in production and if yes, in what use cases? I seem to be failing even in seemingly simpler tasks such as word translation or zero-shot classification. For example, they seem to not care about instructions to only write a response and no explanation, thus making it impossible to use them in a pipeline :&#x2F;
> **theanonymousone**: 我可以问一下，是否有人在生产中成功使用了1B和3B型号，如果是，在哪些用例中？我似乎甚至在单词翻译或零样本分类等看似简单的任务中都失败了。例如，他们似乎不关心只写响应而不解释的指令，因此无法在管道中使用它们：；
            
**formalsystem**: Hi I&#x27;m Mark I work on torchao which was used for the quantization aware training and ARM kernels in this blog. If you have any questions about quantization or performance more generally feel free to let me know!
> **formalsystem**: 嗨，我；m Mark我在这个博客中研究了torchao，它用于量化感知训练和ARM内核。如果您对量化或性能有任何疑问，请随时告诉我！