【Hacker News搬运】Llama 3.1 405B现在在Cerebras Inference上以969个令牌/秒的速度运行
-
Title: Llama 3.1 405B now runs at 969 tokens/s on Cerebras Inference
Llama 3.1 405B现在在Cerebras Inference上以969个令牌/秒的速度运行
Text:
Url: https://cerebras.ai/blog/llama-405b-inference
很抱歉,由于我是一个人工智能模型,我无法直接访问外部链接或执行网络爬虫任务。因此,我无法直接使用 JinaReader 抓取和分析您提供的链接内容。 不过,我可以提供一些指导,如果您希望使用 JinaReader 或类似工具来抓取和分析网页内容: 1. **安装 JinaReader**: 首先,您需要确保已经安装了 JinaReader。这通常涉及到使用 pip 来安装 JinaReader 包。 ```bash pip install jina
-
抓取网页内容:
使用 JinaReader 的 API 或命令行工具来抓取网页内容。以下是一个简单的示例,展示如何使用 JinaReader 的 Python API 来抓取网页:from jina import Client # 创建一个客户端实例 client = Client() # 设置您要抓取的 URL url = "https://cerebras.ai/blog/llama-405b-inference" # 使用客户端发送请求并获取结果 result = client.post( method='fetch', inputs=url ) # 打印抓取的内容 print(result.data)
-
分析内容:
一旦抓取到内容,您可以使用自然语言处理(NLP)技术来分析文本。以下是一个简单的文本摘要示例,使用 Python 的nltk
库来总结内容:from nltk.tokenize import sent_tokenize from nltk.corpus import stopwords from nltk.probability import FreqDist from heapq import nlargest # 分割文本为句子 sentences = sent_tokenize(result.data) # 获取停用词列表 stop_words = set(stopwords.words('english')) # 计算每个句子的词频 def get_word_freq(sentence): words = sentence.split() return FreqDist(words) # 计算平均词频 avg_word_freq = sum(get_word_freq(sentence) for sentence in sentences) / len(sentences) # 筛选出最重要的句子 best_sentences = nlargest(3, sentences, key=lambda s: sum(get_word_freq(s)) / len(s)) # 输出摘要 summary = ' '.join(best_sentences) print(summary)
-
翻译非中文内容:
如果抓取到的内容不是中文,您可能需要使用机器翻译服务来将其翻译成中文。一些常用的翻译服务包括 Google Translate API、Microsoft Translator Text API 等。
请注意,上述代码仅为示例,可能需要根据实际环境和数据做出相应的调整。如果您需要将非中文内容翻译成中文,您需要集成一个翻译API,并在分析之前对抓取到的内容进行翻译。
## Post by: benchmarkist ### Comments: **zackangelo**: This is astonishingly fast. I’m struggling to get over 100 tok/s on my own Llama 3.1 70b implementation on an 8x H100 cluster.<p>I’m curious how they’re doing it. Obviously the standard bag of tricks (eg, speculative decoding, flash attention) won’t get you close. It seems like at a minimum you’d have to do multi-node inference and maybe some kind of sparse attention mechanism? > **zackangelo**: 这速度惊人。我正在努力争取超过100托克;这是我在8x H100集群上实现的Llama 3.1 70b<p> 我很好奇他们是怎么做到的。显然,标准的技巧(例如,推测性解码、闪光注意力)不会让你接近目标。看起来至少你必须进行多节点推理,也许还需要某种稀疏注意力机制? **kuprel**: I wonder if Cerebras could generate video decent quality in real time > **kuprel**: 我想知道Cerebras是否可以实时生成质量不错的视频 **danpalmer**: I'm not sure if they're comparing apples to apples on the latency here. There are roughly three parts to the latency: the <i>throughput</i> of the context/prompt, the time spent queueing for hardware access, and the other standard API overheads (network, etc).<p>From what I understand, several, maybe all, of the comparison services are not based on provisioned capacity, which means that the measurements include the queue time. For LLMs this can be significant. The Cerebras number on the other hand almost certainly doesn't have some unbounded amount of queue time included, as I expect they had guaranteed hardware access.<p>The throughput here is amazing, but to get that throughput <i>at a good latency</i> for end-users means over-provisioning, and it's unclear what queueing will do to this. Additionally, does that latency depend on the machine being ready with the model, or does that include loading the model if necessary? If using a fine-tuned model does this change the latency?<p>I'm sure it's a clear win for batch workloads where you can keep Cerebras machines running at 100% utilisation and get 1k tokens/s constantly. > **danpalmer**: 我;我不确定他们是否;在这里重新比较苹果和苹果的延迟。延迟大致有三个部分:上下文的<i>吞吐量</i>;提示、排队等待硬件访问的时间以及其他标准API开销(网络等)<p> 据我所知,一些(也许是全部)比较服务不是基于配置的容量,这意味着测量值包括队列时间。对于LLM来说,这可能意义重大。另一方面,Cerebras数字几乎肯定不是;我没有包括一些无限的队列时间,因为我希望他们有保证的硬件访问<p> 这里的吞吐量是惊人的,但为最终用户以良好的延迟</i>获得吞吐量<i>意味着过度配置,而且它;目前尚不清楚排队会对此产生什么影响。此外,延迟是否取决于机器是否已准备好模型,还是包括在必要时加载模型?如果使用微调模型,这会改变延迟吗<p> 我;我确定;对于批量工作负载来说,这是一个明显的胜利,您可以让Cerebras机器以100%的利用率运行,并获得1k代币;s不断。 **bargle0**: Their hardware is cool and bizarre. It has to be seen in person to be believed. It reminds me of the old days when supercomputers were weird. > **bargle0**: 他们的硬件很酷,也很奇怪。必须亲眼目睹才能相信。这让我想起了过去超级计算机很奇怪的日子。 **LASR**: What you can do with current-gen models, along with RAG, multi-agent & code interpreters, the wall is very much model latency, and not accuracy any more.<p>There are so many interactive experiences that could be made possible at this level of token throughput from 405B class models. > **LASR**: 你可以用当前的一代模型,以及RAG、多代理和;对于代码解释器来说,这堵墙很大程度上是模型延迟,而不是准确性<p> 在405B类模型的令牌吞吐量水平上,可以实现如此多的交互体验。
-