【Hacker News搬运】100倍缺陷容限：我们如何解决产量问题

hackernews

Title: 100x defect tolerance: How we solved the yield problem

100倍缺陷容限：我们如何解决产量问题

Text:

Url: https://cerebras.ai/blog/100x-defect-tolerance-how-cerebras-solved-the-yield-problem

由于我是一个AI，我无法直接访问外部链接来获取内容。不过，我可以根据您提供的链接和标题给出一个可能的总结。

标题：“100倍容错能力：Cerebras如何解决良率问题”

可能的总结：

Cerebras Systems是一家专注于开发大规模计算芯片的公司，其最新的博客文章讨论了如何通过技术创新解决了半导体制造中的良率问题。文章指出，传统的半导体制造在提高晶体管密度和性能的同时，也带来了更高的缺陷率，这直接影响了芯片的良率。Cerebras提出了一种名为“100倍容错能力”的方法，通过在芯片设计中采用创新的冗余和错误检测/校正机制，大幅提高了芯片的容错能力。

以下是一些可能的内容要点：

1. **背景**：半导体行业在追求更高性能和更高密度的同时，面临着良率下降的问题。这主要是因为随着晶体管尺寸的缩小，制造过程中的缺陷更加难以控制。

2. **Cerebras的解决方案**：Cerebras开发了一种新型的芯片架构，它能够在不影响性能的情况下，提供高达100倍的容错能力。这主要通过以下方式实现：
   - **冗余设计**：在芯片中添加额外的晶体管和电路，以便在检测到错误时能够替换或绕过有缺陷的部分。
   - **错误检测和校正**：采用先进的错误检测和校正技术，能够在不牺牲性能的情况下，纠正芯片运行过程中的错误。

3. **效果**：通过这些创新，Cerebras的芯片在良率上取得了显著提升，从而降低了生产成本，并提高了产品的市场竞争力。

4. **未来展望**：Cerebras的容错技术可能为整个半导体行业带来变革，为更高性能、更可靠的计算设备铺平道路。

请注意，这只是一个基于标题和假设内容的总结。要获取准确的信息，请直接访问上述链接阅读原始文章。

Post by: jwan584

Comments:

ChuckMcM: I think this is an important step, but it skips over that 'fault tolerant routing architecture' means you're spending die space on routes vs transistors. This is exactly analogous to using bits in your storage for error correcting vs storing data.That said, I think they do a great job of exploiting this technique to create a "larger"[1] chip. And like storage it benefits from every core is the same and you don't need to get to every core directly (pin limiting).In the early 2000's I was looking at a wafer scale startup that had the same idea but they were applying it to an FPGA architecture rather than a set of tensor units for LLMs. Nearly the exact same pitch, "we don't have to have all of our GLUs[2] work because the built in routing only uses the ones that are qualified." Xilinx was still aggressively suing people who put SERDES ports on FPGAs so they were pin limited overall but the idea is sound.While I continue to believe that many people are going to collectively lose trillions of dollars ultimately pursuing "AI" at this stage. I appreciate the the amount of money people are willing to put at risk here allow for folks to try these "out of the box" kinds of ideas.[1] It is physically more cores on a single die but the overall system is likely smaller, given the integration here.[2] "Generic Logic Unit" which was kind of an extended LUT with some block RAM and register support.

ChuckMcM: 我认为这是一个重要的步骤，但它跳过了这一步；容错路由架构；意味着你；与晶体管相比，在布线上重新花费芯片空间。这与在存储中使用位进行纠错和存储数据完全类似 也就是说，我认为他们在利用这一技术创造一个&quot；更大”；[1] 芯片。与存储一样，它从每个核心中获得的好处都是一样的，而你却没有；不需要直接到达每个核心（引脚限制） 2000年初；s我在看一家晶圆级初创公司，他们也有同样的想法，但他们将其应用于FPGA架构，而不是LLM的一组张量单元。几乎完全相同的音高，&quot；我们不；我们不必让所有的GLU都工作，因为内置路由只使用合格的GLU&“；Xilinx仍在积极起诉那些在FPGA上安装SERDES端口的人，因此这些端口总体上受到引脚限制，但这个想法是合理的 虽然我仍然认为，许多人最终将集体损失数万亿美元。&quot；AI&quot；在这个阶段。我感谢人们愿意在这里冒着风险投入大量资金，让人们尝试这些&quot；开箱即用&quot；各种各样的想法 [1]在物理上，单个芯片上有更多的内核，但考虑到这里的集成，整个系统可能会更小 [2]&quot；通用逻辑单元”；它是一种具有块RAM和寄存器支持的扩展LUT。

ajb: So they massively reduce the area lost to defects per wafer, from 361 to 2.2 square mm. But from the figures in this blog, this is massively outweighed by the fact that they only get 46222 sq mm useable area out of the wafer, as opposed to 56247 that the H100 gets - because they are using a single square die instead of filling the circular wafer with smaller square dies, they lose 10,025 sq mm!Not sure how that's a win.Unless the rest of the wafer is useable for some other customer?

ajb: 因此，它们大大减少了每个晶圆因缺陷而损失的面积，从361平方毫米减少到2.2平方毫米。但从本博客的数据来看，这远远超过了它们只从晶圆中获得46222平方毫米的可用面积这一事实，而H100的可用面积为56247平方毫米——因为它们使用的是单个方形芯片，而不是用较小的方形芯片填充圆形晶圆，它们损失了10025平方毫米 不知道这是怎么回事；这是一场胜利 除非晶圆的其余部分可供其他客户使用？

NickHoff: Neat. What about power density?An H100 has a TDP of 700 watts (for the SXM5 version). With a die size of 814 mm^2 that's 0.86 W/mm^2. If the cerebras chip has the same power density, that means a cerebras TDP of 37.8 kW.That's a lot. Let's say you cover the whole die area of the chip with water 1 cm deep. How long would it take to boil the water starting from room temperature (20 degrees C)?amount of water = (die area of 46225 mm^2) * (1 cm deep) * (density of water) = 462 gramsenergy needed = (specific heat of water) * (80 kelvin difference) * (462 grams) = 154 kJtime = 154 kJ / 39.8 kW = 3.9 secondsThis thing will boil (!) a centimeter of water in 4 seconds. A typical consumer water cooler radiator would reduce the temperature of the coolant water by only 10-15 C relative to ambient, and wouldn't like it (I presume) if you pass in boiling water. To use water cooling you'd need some extreme flow rate and a big rack of radiators, right? I don't really know. I'm not even sure if that would work. How do you cool a chip at this power density?

NickHoff: 整洁。功率密度呢 H100的TDP为700瓦（适用于SXM5版本）。模具尺寸为814mm^2；s 0.86 W；mm^2。如果大脑芯片具有相同的功率密度，则意味着大脑TDP为37.8kW；太多了。让；假设你用1厘米深的水覆盖芯片的整个芯片区域。从室温（20摄氏度）开始烧开水需要多长时间 水量=（模具面积46225平方毫米）（1厘米深）（水密度）=462克所需能量=（水的比热）（80开尔文差）（462克）=154千焦时间=154千焦耳；39.8千瓦=3.9秒这东西能在4秒内烧开一厘米的水。典型的消费者水冷却器散热器将使冷却水的温度相对于环境仅降低10-15C；（我想）如果你把开水倒进去，我不喜欢。要使用水冷，请；我需要一些极端的流速和一大排散热器，对吧？我不知道；我真的不知道。我；我甚至不确定这是否可行。在这种功率密度下，你如何冷却芯片？

highfrequency: To summarize: localize defect contamination to a very small unit size, by making the cores tiny and redundant.Analogous to a conglomerate wrapping each business vertical in a limited liability veil so that lawsuits and bankruptcy do not bring down the whole company. The smaller the subsidiaries, the less defect contamination but also the less scope for frictionless resource and information sharing.

highfrequency: 总之：通过使核心变得微小和冗余，将缺陷污染定位到非常小的单位尺寸 类似于一个企业集团用有限责任的面纱将每个业务垂直包裹起来，这样诉讼和破产就不会拖垮整个公司。子公司越小，缺陷污染越少，但无摩擦资源和信息共享的范围也越小。

bee_rider: > Second, a cluster of defects could overwhelm fault tolerant areas and disable the whole chip.That’s an interesting point. In architecture class (which was basic and abstract so I’m sure Cerebras is doing something much more clever), we learned that defects cluster, but this is a good thing. A bunch of defects clustering on one core takes out the core, a bunch of defects not clustering could take out… a bunch of cores, maybe rendering the whole chip useless.I wonder why they don’t like clustering. I could imagine in a network of little cores, maybe enough defects clustered on the network could… sort of overwhelm it, maybe?Also I wonder how much they benefit from being on one giant wafer. It is definitely cool as hell. But could chiplets eat away at their advantage?

bee_rider: &gt；其次，一组缺陷可能会淹没容错区域并使整个芯片失效 这是一个有趣的观点。在架构课上（这是基础和抽象的，所以我相信Cerebras正在做一些更聪明的事情），我们了解到缺陷会聚集在一起，但这是一件好事。一堆缺陷聚集在一个核心上会取出核心，一堆没有聚集的缺陷可能会取出……一堆核心，可能会使整个芯片失效 我想知道他们为什么不喜欢集群。我可以想象，在一个由小核心组成的网络中，也许网络上聚集的足够多的缺陷会……有点压倒它，也许吧 我还想知道他们从一个巨大的晶圆上受益多少。这绝对很酷。但小芯片会蚕食它们的优势吗？