【Hacker News搬运】Phi-3技术报告

hackernews

Title: Phi-3 Technical Report

Phi-3技术报告

Text:

Url: https://arxiv.org/abs/2404.14219

这篇论文介绍了一个名为phi-3-mini的语言模型，它拥有38亿个参数，并在3.3万亿个标记上进行了训练。根据学术基准和内部测试，phi-3-mini的整体性能与Mixtral 8x7B和GPT-3.5等模型相当，例如在MMLU上的得分为69%，在MT-bench上的得分为8.38。尽管phi-3-mini足够小，可以部署在手机上，但它的创新之处完全在于其训练数据集，这是一个用于phi-2的缩放版本，由大量过滤的网页数据和合成数据组成。该模型还进一步进行了鲁棒性、安全性和聊天格式的对齐。此外，还提供了一些初步的参数缩放结果，包括一个7B和14B模型，这两个模型分别为phi-3-small和phi-3-medium，它们在4.8T个标记上进行了训练，都比phi-3-mini更强大，例如在MMLU上的得分分别为75%和78%，在MT-bench上的得分分别为8.7和8.9。

Post by: varunvummadi

Comments:

oersted: Incredible, rivals Llama 3 8B with 3.8B parameters after less than a week of release.And on LMSYS English, Llama 3 8B is on par with GPT-4 (not GPT-4-Turbo), as well as Mistral-Large.Source: <a href="https://chat.lmsys.org/?leaderboard" rel="nofollow">https://chat.lmsys.org/?leaderboard</a> (select English in the dropdown)So we now have an open-source LLM approximately equivalent in quality to GPT-4 that can run on phones? Kinda? Wild.(I'm sure there's a lot of nuance to it, for one these benchmarks are not so hard to game, we'll see how the dust settles, but still...)Phi-3-mini 3.8b: 71.2Phi-3-small 7b: 74.9Phi-3-medium 14b: 78.2Phi-2 2.7b: 58.8Mistral 7b: 61.0Gemma 7b: 62.0Llama-3-In 8b: 68.0Mixtral 8x7b: 69.9GPT-3.5 1106: 75.3(these are averages across all tasks for each model, but looking at individual scores shows a similar picture)

oersted: 令人难以置信的是，在发布不到一周后，其参数为3.8B的竞争对手Llama 3 8B 在LMSYS English上，Llama 3 8B与GPT-4（而非GPT-4-Turbo）以及Mistral Large不相上下。来源：<a href=“https://；&#x2F；chat.LMSYS.org&#x2F？leaderboard”rel=“nofollow”>https://&#x2F；chat.lmsys.org&#x2F；？排行榜</a>（在下拉列表中选择英语）所以我们现在有了一个开源LLM，其质量与GPT-4大致相当，可以在手机上运行？有点自然生长的p> （我确信其中有很多细微之处，因为这些基准测试并不那么难玩，我们将看看尘埃是如何落定的，但仍然…）Phi-3-mini 3.8b:71.2Phi-3-small 7b:74.9Ph-3-mediate 14b:78.2Phi-2 2.7b:58.8Mistral 7b:61.0Gemma 7b:62.0Llama-3-In 8b:68.0Mixtral 8x7b:69.9GPT-3.5 1106:75.3（这些是每个模型所有任务的平均值，但从单个分数来看，情况相似）

brcmthrowaway: If I was Apple I'd be quaking in my boots. They are getting too far behind to ever catch up. Nokia in 2010 vibes.

brcmthrowaway: 如果我是苹果；I’我吓得发抖。他们落后得太远了，根本追不上。诺基亚在2010年的氛围。

simonw: I'm getting a bit skeptical of MMLU at this point. As far as I can tell it's a set of multiple choice questions that hasn't been updated since 2020. We have to trust the model providers not to deliberately or accidentally train on it for those scores to be useful.

simonw: I-；在这一点上，我对MMLU有点怀疑。据我所知；这是一组多项选择题；自2020年以来未更新。我们必须相信模型提供者不会故意或意外地对其进行训练，以使这些分数有用。

blackoil: Has anyone used these/similar with fine tune and RAG? How is the performance over a narrow domain for simple queries? Is it good enough for say an informational chat bot?

blackoil: 有人使用过这些&#x2F；类似于微调和RAG？简单查询在窄域上的性能如何？比如说一个信息聊天机器人就足够了吗？

hackerlight: Less tokens than Llama 3 (3.3T vs 15T) yet better outcome. No doubt more information dense training data. The interesting thing is the use of synthetic data which they don't talk about.

hackerlight: 代币少于Llama 3（3.3T vs 15T），但结果更好。毫无疑问，更多信息密集的训练数据。有趣的是使用合成数据；不要谈论。