【Hacker News搬运】JetMoE：以10万美元实现LLaMA2性能

hackernews

Title: JetMoE: Reaching LLaMA2 performance with 0.1M dollars

JetMoE：以10万美元实现LLaMA2性能

Text:

Url: https://research.myshell.ai/jetmoe

标题：JetMoE: 以不到0.1百万美元的成本达到LLaMA2的性能
作者：无
发布日期：无
顶部图片链接：无
文本：

关键信息
JetMoE-8B的训练成本不到0.1百万美元，但性能超过了Meta AI的LLaMA2-7B，后者拥有数十亿美元的训练资源。大型语言模型（LLM）的训练成本可能远低于人们的普遍认知。
JetMoE-8B非常开放且有利于学术界，因为：
它只使用公共数据集进行训练，并且代码是开源的。不需要专有资源。
它可以在非常有限的计算资源（例如，消费级GPU）上进行微调，大多数实验室都可以承担。
JetMoE-8B在推理阶段的活跃参数只有22亿，这大大降低了计算成本。与具有相似推理计算的模型（如Gemma-2B）相比，JetMoE-8B始终获得更好的性能。1 我们使用一个96×H100 GPU集群进行了两周的训练，花费了约0.08百万美元。
Github: https://github.com/myshell-ai/JetMoE
HuggingFace: https://huggingface.co/jetmoe/jetmoe-8b
Chat Demo on Lepton AI: https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat
作者
该项目由Yikang Shen, Zhen Guo, Tianle Cai和Zengyi Qin贡献。的技术问题，请联系Yikang Shen。媒体和合作问题，请联系Zengyi Qin。
合作
如果你有伟大的想法，但需要更多的资源（GPU、数据、资金等），欢迎联系Zengyi Qin。我们愿意合作，并积极支持高质量的开源项目。
基准测试
我们使用了与Open LLM排行榜相同评估方法。对于MBPP代码基准，我们使用了与LLaMA2和Deepseek-MoE论文相同的评估方法。结果如下：
模型
活跃参数
训练标记
MBPP
Open LLM 排行榜平均
ARCH
Hellaswag
MMLU
TruthfulQA
WinoGrande
GSM 8K
Gemma-2B
2B
2T
28.04
6.44
8.47
1.84
1.83
3.16
6.31
6.9
DeepseekMoE-16B
2.8B
2T
34.05
1.15
3.27
9.84
6.33
6.17
3.71
7.3
LLaMA2-7B
7B
2T
20.85
1.05
3.17
8.64
6.93
8.87
4.01
4.5
LLaMA-13B
13B
1T
22.05
1.45
6.28
10.94
7.73
9.57
6.27
7.6
JetMoE-8B
2.2B
1.25T
34.25
3.04
8.78
10.54
49.24
1.77
27.8
模型
MT-Bench 分数
GPT-4
9.014
GPT-3.5-turbo
7.995
Claude-v1
7.923
JetMoE-8B-chat
6.681
Llama-2-13b-chat
6.650
Vicuna-13b-v1.3
6.413
Wizardlm-13b
6.353
Llama-2-7b-chat
6.269
令人惊讶的是，尽管训练成本和计算资源较低，JetMoE-8B的性能甚至超过了LLaMA2-7B、LLaMA-13B和DeepseekMoE-16B。与具有相似训练和推理计算的模型（如Gemma-2B）相比，JetMoE-8B获得更好的性能。
模型细节
JetMoE使用了一种由ModuleFormer启发的稀疏激活架构。JetMo

Post by: gyre007

Comments:

lolinder: > JetMoE-8B is trained with less than $ 0.1 million1 cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.They want you to read this as "we spent $100k compared to Meta's spending billions", but that's not actually what this says. It says that they spent $100k and Meta has the resources to spend billions if they wanted to.We don't know what Facebook spent on training LLaMA 2, but they say that it took them 184320 A100-80GB GPU-hours to train the 7B model [0]. AWS charges $14.46/hour for an instance that has 8 of those [1], which amounts to $1.81/GPU/hr.At that rate and assuming they paid something resembling AWS's list price, LLaMA 2 7B cost ~$333k. That's more than $100k, but not by orders of magnitude, and it's likely that Facebook wasn't paying the full price AWS is charging today.[0] <a href="https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#hardware-and-software">https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...</a>[1] <a href="https://aws.amazon.com/ec2/instance-types/p4/" rel="nofollow">https://aws.amazon.com/ec2/instance-types/p4/</a>

lolinder: &gt；JetMoE-8B的训练成本不到10万美元1，但优于Meta AI的LLaMA2-7B，后者拥有数十亿美元的训练资源。LLM培训可能比人们普遍认为的要便宜得多 他们想让你把它读成“；与Meta相比，我们花费了10万美元；s花费数十亿”；，但是；这并不是事实。它说他们花了10万美元，Meta有资源如果他们想花数十亿美元；我不知道Facebook在训练LLaMA 2上花了多少钱，但他们说他们花了184320个A100-80GB GPU小时来训练7B模型[0]。AWS收费$14.46；对于具有这些[1]中的8个的实例，其总计为$1.81；GPU；hr.＜p＞以该速率并且假设他们支付了类似于AWS的东西；根据标价，LLaMA 2 7B的成本约为33.3万美元。那个；s超过10万美元，但不是数量级的，并且它；很可能Facebook是；不要支付AWS今天收取的全部费用 [0]<a href=“https://；&#x2F；github.com#xx2F；meta骆驼&#x2F！骆驼&#x2F；blob&#x2F？main&#x2FMODEL_CARD.md#硬件和软件”>https://&#x2F；github.com&#x2F；间美洲驼；美洲驼；blob；main；MODEL_CARD.md#</a> [1]<a href=“https://；&#x2F；aws.amazon.com&#x2F：ec2&#x2F实例类型&#x2F，p4&#x2F”rel=“nofollow”>https://&#x2F；aws.amazon.com&#x2F；ec2；实例类型；p4</一

plufz: You’ve been in tech for too long when 1 million USD is your smallest unit.

plufz: 当100万美元是你最小的单位时，你已经在科技界呆了太久了。

antimatter15: It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to train[1]. This one says it used a 96×H100 GPU cluster for 2 weeks, for 32,256 hours. That's 17.5% of the number of hours, but H100s are faster than A100s [2] and FP16/bfloat16 performance is ~3x better.If they had tried to replicate Llama 2 identically with their hardware setup, it'd cost a little bit less than twice their MoE model.[1] <a href="https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#hardware-and-software">https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...</a>[2] <a href="https://blog.ori.co/choosing-between-nvidia-h100-vs-a100-performance-and-costs-considerations" rel="nofollow">https://blog.ori.co/choosing-between-nvidia-h100-vs-a100-per...</a>

antimatter15: 看起来Llama 2 7B花了184320个A100-80GB GPU小时来训练[1]。这个说它使用了96×H100 GPU集群2周，32256小时。那个；s为小时数的17.5%，但H100比A100s[2]和FP16-F快；bfloat16的性能提高了约3倍 如果他们试图用他们的硬件设置相同地复制Llama 2；d的成本略低于MoE模型的两倍 [1]<a href=“https://；&#x2F；github.com#xx2F；meta骆驼&#x2F！骆驼&#x2F；blob&#x2F？main&#x2FMODEL_CARD.md#硬件和软件”>https://&#x2F；github.com&#x2F；间美洲驼；美洲驼；blob；main；MODEL_CARD.md#</a> [2]<a href=“https://；&#x2F；blog.ori.co&#x2F：choosing-between-nvidia-h100-vs-a100-performance-and-costs-contents”rel=“nofollow”>https://&#x2F；blog.ori.co&#x2F；choosing-between-nvidia-h100-vs-a100-per</一

vertis: It might be equivalent to LLaMA2 but it's still not capable of even simple reasoning:> If two apples cost 2 dollars and 1 apple costs 1.20, what is the discount you're getting> To calculate the discount, you need to compare the original price of the apples to the price after the discount.> The original price for two apples is: 2 apples * $1.20/apple = $2.40> The price for one apple after the discount is: 1 apple * $2.00/apple = $2.00> Now, let's calculate the discount for one apple:
> Original price for one apple - Discounted price for one apple = Discount amount
> $2.40 - $2.00 = $0.40> The discount for one apple is $0.40. Since you're buying two apples, you'll get the discount on both:
> Discount amount for two apples = 2 * $0.40 = $0.80> So, the discount you're getting for two apples is $0.80.

vertis: 它可能等同于LLaMA2；s仍然无法进行简单的推理：&gt；如果两个苹果花2美元，一个苹果花1.20美元，你的折扣是多少；正在获得&gt；要计算折扣，你需要将苹果的原价与折扣后的价格进行比较 &gt；两个苹果的原始价格是：2个苹果1.20美元；苹果=2.40美元；折扣后一个苹果的价格是：1个苹果2.00美元；苹果=2.00美元&gt；现在，让；我们计算一个苹果的折扣：&gt；一个苹果原价-一个苹果折扣价=折扣金额&gt$2.40-2.00美元=0.40美元&gt；一个苹果的折扣是0.40美元。既然你；你在买两个苹果；I’我可以享受以下两项的折扣：&gt；两个苹果的折扣金额=2*0.40美元=0.80美元&gt；所以，你的折扣；你买两个苹果要0.80美元。

kleiba: I've been out of academia for a bit, but in my day 100k USD would not have been considered academia-friendly in my neck of the woods...

kleiba: I-；我离开学术界有一段时间了，但在我的时代，10万美元在我的困境中被认为对学术界友好。。。