【Hacker News搬运】JetMoE:以10万美元实现LLaMA2性能
-
Title: JetMoE: Reaching LLaMA2 performance with 0.1M dollars
JetMoE:以10万美元实现LLaMA2性能
Text:
Url: https://research.myshell.ai/jetmoe
标题:JetMoE: 以不到0.1百万美元的成本达到LLaMA2的性能 作者:无 发布日期:无 顶部图片链接:无 文本: 关键信息 JetMoE-8B的训练成本不到0.1百万美元,但性能超过了Meta AI的LLaMA2-7B,后者拥有数十亿美元的训练资源。大型语言模型(LLM)的训练成本可能远低于人们的普遍认知。 JetMoE-8B非常开放且有利于学术界,因为: 它只使用公共数据集进行训练,并且代码是开源的。不需要专有资源。 它可以在非常有限的计算资源(例如,消费级GPU)上进行微调,大多数实验室都可以承担。 JetMoE-8B在推理阶段的活跃参数只有22亿,这大大降低了计算成本。与具有相似推理计算的模型(如Gemma-2B)相比,JetMoE-8B始终获得更好的性能。1 我们使用一个96×H100 GPU集群进行了两周的训练,花费了约0.08百万美元。 Github: https://github.com/myshell-ai/JetMoE HuggingFace: https://huggingface.co/jetmoe/jetmoe-8b Chat Demo on Lepton AI: https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat 作者 该项目由Yikang Shen, Zhen Guo, Tianle Cai和Zengyi Qin贡献。的技术问题,请联系Yikang Shen。媒体和合作问题,请联系Zengyi Qin。 合作 如果你有伟大的想法,但需要更多的资源(GPU、数据、资金等),欢迎联系Zengyi Qin。我们愿意合作,并积极支持高质量的开源项目。 基准测试 我们使用了与Open LLM排行榜相同评估方法。对于MBPP代码基准,我们使用了与LLaMA2和Deepseek-MoE论文相同的评估方法。结果如下: 模型 活跃参数 训练标记 MBPP Open LLM 排行榜平均 ARCH Hellaswag MMLU TruthfulQA WinoGrande GSM 8K Gemma-2B 2B 2T 28.04 6.44 8.47 1.84 1.83 3.16 6.31 6.9 DeepseekMoE-16B 2.8B 2T 34.05 1.15 3.27 9.84 6.33 6.17 3.71 7.3 LLaMA2-7B 7B 2T 20.85 1.05 3.17 8.64 6.93 8.87 4.01 4.5 LLaMA-13B 13B 1T 22.05 1.45 6.28 10.94 7.73 9.57 6.27 7.6 JetMoE-8B 2.2B 1.25T 34.25 3.04 8.78 10.54 49.24 1.77 27.8 模型 MT-Bench 分数 GPT-4 9.014 GPT-3.5-turbo 7.995 Claude-v1 7.923 JetMoE-8B-chat 6.681 Llama-2-13b-chat 6.650 Vicuna-13b-v1.3 6.413 Wizardlm-13b 6.353 Llama-2-7b-chat 6.269 令人惊讶的是,尽管训练成本和计算资源较低,JetMoE-8B的性能甚至超过了LLaMA2-7B、LLaMA-13B和DeepseekMoE-16B。与具有相似训练和推理计算的模型(如Gemma-2B)相比,JetMoE-8B获得更好的性能。 模型细节 JetMoE使用了一种由ModuleFormer启发的稀疏激活架构。JetMo
Post by: gyre007
Comments:
lolinder: > JetMoE-8B is trained with less than $ 0.1 million1 cost but outperforms LLaMA2-7B from Meta AI, who has multi-billion-dollar training resources. LLM training can be much cheaper than people generally thought.<p>They want you to read this as "we spent $100k compared to Meta's spending billions", but that's not actually what this says. It says that they spent $100k and Meta <i>has the resources</i> to spend billions if they wanted to.<p>We don't know what Facebook spent on training LLaMA 2, but they say that it took them 184320 A100-80GB GPU-hours to train the 7B model [0]. AWS charges $14.46/hour for an instance that has 8 of those [1], which amounts to $1.81/GPU/hr.<p>At that rate and assuming they paid something resembling AWS's list price, LLaMA 2 7B cost ~$333k. That's more than $100k, but not by orders of magnitude, and it's likely that Facebook wasn't paying the full price AWS is charging today.<p>[0] <a href="https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#hardware-and-software">https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...</a><p>[1] <a href="https://aws.amazon.com/ec2/instance-types/p4/" rel="nofollow">https://aws.amazon.com/ec2/instance-types/p4/</a>
lolinder: >;JetMoE-8B的训练成本不到10万美元1,但优于Meta AI的LLaMA2-7B,后者拥有数十亿美元的训练资源。LLM培训可能比人们普遍认为的要便宜得多<p> 他们想让你把它读成“;与Meta相比,我们花费了10万美元;s花费数十亿”;,但是;这并不是事实。它说他们花了10万美元,Meta<i>有资源</i>如果他们想花数十亿美元;我不知道Facebook在训练LLaMA 2上花了多少钱,但他们说他们花了184320个A100-80GB GPU小时来训练7B模型[0]。AWS收费$14.46;对于具有这些[1]中的8个的实例,其总计为$1.81;GPU;hr.<p>以该速率并且假设他们支付了类似于AWS的东西;根据标价,LLaMA 2 7B的成本约为33.3万美元。那个;s超过10万美元,但不是数量级的,并且它;很可能Facebook是;不要支付AWS今天收取的全部费用<p> [0]<a href=“https://;/;github.com#xx2F;meta骆驼/!骆驼/;blob/?main/MODEL_CARD.md#硬件和软件”>https:///;github.com/;间美洲驼;美洲驼;blob;main;MODEL_CARD.md#</a> <p>[1]<a href=“https://;/;aws.amazon.com/:ec2/实例类型/,p4/”rel=“nofollow”>https:///;aws.amazon.com/;ec2;实例类型;p4</一
plufz: You’ve been in tech for too long when 1 million USD is your smallest unit.
plufz: 当100万美元是你最小的单位时,你已经在科技界呆了太久了。
antimatter15: It looks like Llama 2 7B took 184,320 A100-80GB GPU-hours to train[1]. This one says it used a 96×H100 GPU cluster for 2 weeks, for 32,256 hours. That's 17.5% of the number of hours, but H100s are faster than A100s [2] and FP16/bfloat16 performance is ~3x better.<p>If they had tried to replicate Llama 2 identically with their hardware setup, it'd cost a little bit less than twice their MoE model.<p>[1] <a href="https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#hardware-and-software">https://github.com/meta-llama/llama/blob/main/MODEL_CARD.md#...</a><p>[2] <a href="https://blog.ori.co/choosing-between-nvidia-h100-vs-a100-performance-and-costs-considerations" rel="nofollow">https://blog.ori.co/choosing-between-nvidia-h100-vs-a100-per...</a>
antimatter15: 看起来Llama 2 7B花了184320个A100-80GB GPU小时来训练[1]。这个说它使用了96×H100 GPU集群2周,32256小时。那个;s为小时数的17.5%,但H100比A100s[2]和FP16-F快;bfloat16的性能提高了约3倍<p> 如果他们试图用他们的硬件设置相同地复制Llama 2;d的成本略低于MoE模型的两倍<p> [1]<a href=“https://;/;github.com#xx2F;meta骆驼/!骆驼/;blob/?main/MODEL_CARD.md#硬件和软件”>https:///;github.com/;间美洲驼;美洲驼;blob;main;MODEL_CARD.md#</a> <p>[2]<a href=“https://;/;blog.ori.co/:choosing-between-nvidia-h100-vs-a100-performance-and-costs-contents”rel=“nofollow”>https:///;blog.ori.co/;choosing-between-nvidia-h100-vs-a100-per</一
vertis: It might be equivalent to LLaMA2 but it's still not capable of even simple reasoning:<p>> If two apples cost 2 dollars and 1 apple costs 1.20, what is the discount you're getting<p>> To calculate the discount, you need to compare the original price of the apples to the price after the discount.<p>> The original price for two apples is: 2 apples * $1.20/apple = $2.40<p>> The price for one apple after the discount is: 1 apple * $2.00/apple = $2.00<p>> Now, let's calculate the discount for one apple:
> Original price for one apple - Discounted price for one apple = Discount amount
> $2.40 - $2.00 = $0.40<p>> The discount for one apple is $0.40. Since you're buying two apples, you'll get the discount on both:
> Discount amount for two apples = 2 * $0.40 = $0.80<p>> So, the discount you're getting for two apples is $0.80.vertis: 它可能等同于LLaMA2;s仍然无法进行简单的推理:<p>>;如果两个苹果花2美元,一个苹果花1.20美元,你的折扣是多少;正在获得<p>>;要计算折扣,你需要将苹果的原价与折扣后的价格进行比较<p> >;两个苹果的原始价格是:2个苹果1.20美元;苹果=2.40美元;折扣后一个苹果的价格是:1个苹果2.00美元;苹果=2.00美元<p>>;现在,让;我们计算一个苹果的折扣:>;一个苹果原价-一个苹果折扣价=折扣金额>$2.40-2.00美元=0.40美元<p>>;一个苹果的折扣是0.40美元。既然你;你在买两个苹果;I’我可以享受以下两项的折扣:>;两个苹果的折扣金额=2*0.40美元=0.80美元<p>>;所以,你的折扣;你买两个苹果要0.80美元。
kleiba: I've been out of academia for a bit, but in my day 100k USD would <i>not</i> have been considered academia-friendly in my neck of the woods...
kleiba: I-;我离开学术界有一段时间了,但在我的时代,10万美元在我的困境中被认为对学术界友好。。。