【Hacker News搬运】Stability.ai – Introducing Stable Video 3D

hackernews

Title: Stability.ai – Introducing Stable Video 3D

Stability.ai–推出稳定的3D视频

Text:

From: https://news.ycombinator.com/item?id=39749312

Url: https://stability.ai/news/introducing-stable-video-3d

标题：介绍 Stable Video 3D：从单图像中生成高质量的新视图和3D模型——Stability AI
作者：18 Mar

        由 Anel Islamovic 撰写

发布日期：未提供
顶部图片链接：未提供
文本：

SV3D 接受单个对象图像作为输入，并输出该对象的新奇多视图。我们可以使用这些新奇视图和SV3D来生成3D网格。当我们发布稳定视频扩散时，我们强调了我们的视频模型在各种应用中的多功能性。在此基础上，我们很高兴地发布稳定视频3D。这个新模型推动了3D技术的发展，与之前发布的稳定零123相比，在质量和新视图方面有了很大的改进，并且超越了其他开源替代品如零123-XL。这个版本有两个变种：
SV3D_u：这个变种基于单图像输入生成轨道视频，不进行相机条件化。
SV3D_p：扩展SVD3_u的功能，这个变种既接受单图像也接受轨道视图，允许创建沿指定相机路径的3D视频。
稳定视频3D现在可用于商业目的，需要Stability AI会员。对于非商业用途，您可以在Hugging Face上下载模型权重，并在这里查看我们的研究论文。
视频扩散的优势
通过适应我们的稳定视频扩散图像到视频扩散模型，并添加相机路径条件化，稳定视频3D能够生成物体的多视图视频。与用于稳定零123的图像扩散模型相比，视频扩散模型在泛化和视图一致性方面提供了重大优势。此外，我们提出了改进的3D优化，利用稳定视频3D生成任意物体周围轨道的强大能力。通过进一步实施这些技术，以及解缠光照优化和新的遮蔽分数蒸馏采样损失函数，稳定视频3D能够从单图像输入可靠地输出高质量的3D网格。
在这里查看有关稳定视频3D模型和实验比较的详细技术报告。
新视图生成
稳定视频3D在3D生成方面取得了重大进展，特别是在新视图合成（NVS）方面。与以前的方法不同，这些方法通常难以处理有限的视角和不一致的输出，稳定视频3D能够从任何给定角度提供连贯的视图，并具备出色的泛化能力。这种能力不仅提高了姿态可控性，还确保了多个视图之间对象外观的一致性，进一步改善了真实感和准确的3D生成的关键方面。

Post by: ed

Comments:

extheat: At 8x86B, looks like the largest open model yet by far. Would be interesting to hear how many tokens it's been trained on. Especially important for higher param models in order to efficiently utilize all those parameters.

extheat: 8x86B，看起来是迄今为止最大的开放式机型。很有意思的是，听听它有多少代币；s进行了训练。对于更高参数的模型，为了有效地利用所有这些参数，这一点尤为重要。

ilaksh: Has anyone outside of x.ai actually done inference with this model yet? And if so, have they provided details of the hardware? What type of AWS instance or whatever?I think you can rent like an 8 x A100 or 8 x H100 and it's "affordable" to play around with for at least a few minutes. But you would need to know exactly how to set up the GPU cluster.Because I doubt it's as simple as just 'python run.py' to get it going.

ilaksh: 除了x.ai之外，有人真的用这个模型做过推理吗？如果是，他们是否提供了硬件的详细信息？什么类型的AWS实例或其他什么 我认为你可以租一辆8 x A100或8 x H100；s〃；负担得起的“；至少玩几分钟。但您需要确切地知道如何设置GPU集群 因为我对此表示怀疑；s简单到仅为；python运行.py；让它继续下去。

nasir: I'd be very curious to see how it performs especially on inputs that's blocked by other models. Seems like Grok will differentiate itself from other OS models from a cencorship and alignment perspective.

nasir: I-；d非常好奇地看到它是如何执行的；s被其他型号挡住了。看起来Grok将从协调和一致的角度将自己与其他操作系统模型区分开来。

simonw: "Base model trained on a large amount of text data, not fine-tuned for any particular task."Presumably the version they've been previewing on Twitter is an instruction-tuned model which behaves quite differently from these raw weights.

simonw: &quot；基于大量文本数据训练的基础模型，不针对任何特定任务进行微调&quot 据推测，他们的版本；我在推特上预览了一个经过指令调整的模型，它的行为与这些原始权重截然不同。

nylonstrung: For what reason would you want to use this instead of open source alternatives like Mistral

nylonstrung: 你为什么要使用它而不是像Mistral这样的开源替代品

jjcm: I think it's smart to start trying things here. This has infinite flaws with it, but from a business and learnings standpoint it's a step toward the right direction. Over time we're going to both learn and decide what is and isn't important to designate as "AI" - Google's approach here at least breaks this into rules of what "AI" things are important to label:• Makes a real person appear to say or do something they didn't say or do• Alters footage of a real event or place• Generates a realistic-looking scene that didn't actually occurAt the very least this will test each of these hypotheses, which we'll learn from and iterate on. I am curious to see the legal arguments that will inevitably kick up from each of these - is color correction altering footage of a real event or place? They explicitly say it isn't in the wider description, but what about beauty filters? If I have 16 video angles, and use photogrammetry / gaussian splatting / AI to generate a 17th, is that a realistic-looking scene that didn't actually occur? Do I need to have actually captured the photons themselves if I can be 99% sure my predictions of them are accurate?So many flaws, but all early steps have flaws. At least it is a step.

jjcm: 我认为；在这里开始尝试是明智的。这有无限的缺陷，但从商业和学习的角度来看；这是朝着正确方向迈出的一步。随着时间的推移，我们；我们将学习并决定什么是和不是；将其指定为“；AI”-谷歌；这里的方法至少将其分解为“什么”的规则；AI”；事物是重要的标签：＜p＞•让真实的人看起来说或做了他们没有做的事情；t说或做•改变真实事件或地点的镜头？生成逼真的场景；t实际发生至少这将检验这些假设中的每一个；我将从中学习并不断迭代。我很想看看每一个不可避免地会引发的法律争论——颜色校正是否会改变真实事件或地点的镜头？他们明确表示这不是；在更广泛的描述中，但美容滤镜呢？如果我有16个视频角度，并且使用摄影测量；高斯飞溅；人工智能生成的第17个场景，是一个看起来很逼真的场景；实际上没有发生？如果我能99%地确定我对光子的预测是准确的，我是否需要真正捕获光子本身 这么多缺陷，但所有早期步骤都有缺陷。至少这是一个步骤。

summerlight: Looks like there is a huge grea area that they need to figure out in practice. From <a href="https://support.google.com/youtube/answer/14328491#" rel="nofollow">https://support.google.com/youtube/answer/14328491#</a>:Examples of content creators don’t have to disclose:<pre><code> * Someone riding a unicorn through a fantastical world

Green screen used to depict someone floating in space
Color adjustment or lighting filters
Special effects filters, like adding background blur or vintage effects
Production assistance, like using generative AI tools to create or improve a video outline, script, thumbnail, title, or infographic
Caption creation
Video sharpening, upscaling or repair and voice or audio repair
Idea generation
</code></pre>
Examples of content creators need to disclose:<pre><code> * Synthetically generating music (including music generated using Creator Music)
Voice cloning someone else’s voice to use it for voiceover
Synthetically generating extra footage of a real place, like a video of a surfer in Maui for a promotional travel video
Synthetically generating a realistic video of a match between two real professional tennis players
Making it appear as if someone gave advice that they did not actually give
Digitally altering audio to make it sound as if a popular singer missed a note in their live performance
Showing a realistic depiction of a tornado or other weather events moving toward a real city that didn’t actually happen
Making it appear as if hospital workers turned away sick or wounded patients
Depicting a public figure stealing something they did not steal, or admitting to stealing something when they did not make that admission
Making it look like a real person has been arrested or imprisoned</code></pre>

summerlight: 看起来他们需要在练习中找出一个巨大的格雷阿区域。来自<a href=“https://；&#x2F；support.google.com#xx2F；youtube#xx2F！answer&#x2F！14328491#”rel=“nofollow”>https://&#x2F；support.google.com&#x2F；youtube&#x2F；答案&#x2F；14328491#</a>：内容创作者不必披露的例子：<pre><code>有人骑着独角兽穿越奇幻世界绿色屏幕用于描绘漂浮在太空中的人颜色调整或照明过滤器特效滤镜，如添加背景模糊或复古效果制作辅助，如使用生成人工智能工具创建或改进视频大纲、脚本、缩略图、标题或信息图标题创建视频锐化、放大或修复以及语音或音频修复产生想法</code></pre>创作者需要披露的内容示例：<pre><code>综合生成音乐（包括使用创作者音乐生成的音乐）语音克隆他人的语音以用于画外音综合生成真实地方的额外镜头，比如毛伊岛冲浪者的宣传旅游视频综合生成两名真实职业网球运动员比赛的逼真视频让人看起来好像有人给出了他们实际上没有给出的建议对音频进行数字更改，使其听起来像流行歌手在现场表演中错过了一个音符展示龙卷风或其他天气事件向真实城市移动的真实写照，但实际上并没有发生让人觉得医院工作人员拒绝了生病或受伤的病人描述一个公众人物偷了他们没有偷的东西，或者在他们没有承认的情况下承认偷了东西让它看起来像一个真人被逮捕或监禁</code></pre>

the_duke: They don't bother to mention it, but this is actually to comply with the the new EU AI act.> Providers will also have to ensure that AI-generated content is identifiable. Besides, AI-generated text published with the purpose to inform the public on matters of public interest must be labelled as artificially generated. This also applies to audio and video content constituting deep fakes<a href="https://digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai#:~:text=Providers will also have to,video content constituting deep fakes" rel="nofollow">https://digital-strategy.ec.europa.eu/en/policies/regulatory...</a>.Some discussion here: <a href="https://news.ycombinator.com/item?id=39746669">https://news.ycombinator.com/item?id=39746669</a>

the_duke: 他们不；我不想提，但这实际上是为了遵守新的欧盟人工智能法案 &gt；提供商还必须确保人工智能生成的内容是可识别的。此外，人工智能生成的文本是为了向公众通报公共利益事项而发布的，必须被贴上人为生成的标签。这也适用于构成深度伪造的音频和视频内容<a href=“https://；&#x2F；数字战略.ec.europa.eu&#x2F！en&#x2F，政策&#x2F：监管框架ai#：~：text=提供商%20will%20also%20have%20to，视频%20content%20construction%20deep%20fakes”rel=“nofollow”>https://&#x2F；数字战略.ec.europa.eu；en；策略；监管的一p> 这里的一些讨论：<a href=“https://；&#x2F；news.ycombinator.com&#x2F？id=39746669”>https://&#x2F；news.ycombinator.com&#x2F；项目id=39746669</a>

yoavz: Most interesting example to me: "Digitally altering audio to make it sound as if a popular singer missed a note in their live performance".This seems oddly specific to the inverse of what happened recently with Alicia Keys from the recent Superbowl. As Robert Komaniecki pointed out on X [1], Alicia Keys hit a "sour note" which was silently edited by the NFL to fix it.[1] <a href="https://twitter.com/Komaniecki_R/status/1757074365102084464" rel="nofollow">https://twitter.com/Komaniecki_R/status/1757074365102084464</a>

yoavz: 对我来说最有趣的例子是：；对音频进行数字更改，使其听起来像是流行歌手在现场表演中错过了一个音符” 这似乎与最近超级碗中艾丽西亚·凯斯的遭遇正好相反。正如Robert Komaniecki在X[1]上指出的那样，Alicia Keys打出了一个“；酸味”；NFL对其进行了静默编辑以修复它。[1]<a href=“https://；&#x2F；twitter.com&#x2F：Komaniecki_R&x2F；status&#x2F，1757074365102084464”rel=“nofollow”>https://&#x2F；twitter；Komaniecki_ R；status；1757074365102084464</a>

sigmoid10: >Some examples of content that require disclosure include: [...] Generating realistic scenes: Showing a realistic depiction of fictional major events, like a tornado moving toward a real town.This sounds like every thumbnail on youtube these days. It's good that this is not limited to AI, but it also means this will be a nightmare to police.

sigmoid10: &gt；一些需要披露的内容示例包括：[…]生成真实场景：显示虚构重大事件的真实描述，如龙卷风向真实城镇移动 这听起来像是最近youtube上的每一个缩略图。它；这不仅限于人工智能，这很好，但也意味着这将是警方的噩梦。

thrdbndndn: The emphasis here is Single Image, but can this model generate with multiple images too?We know that a single image of an object physically can't cover all the sides of it, so it's all guesswork in AI. This is totally fine for certain scenario, but in lots of other cases, it's trivial to have multiple images of the same object, and if that offers higher fidelity, it's totally worth it.I'm aware there are many algorithms or AI models that already
do that. I'm asking about Stability's one specifically because if they have impressive Single Image result, surely their multi-image results would also be much better than state-of-the-art?

thrdbndndn: 这里的重点是单图像，但这个模型也能用多个图像生成吗 我们知道，物体的单个图像在物理上可以；t覆盖它的所有侧面，所以它；这都是人工智能中的猜测。这对某些场景来说是完全可以的，但在许多其他情况下；具有同一对象的多个图像是微不足道的，并且如果这提供了更高的保真度；这完全值得；我知道已经有很多算法或人工智能模型这样做。I-；m询问稳定性；特别是因为如果他们有令人印象深刻的单图像结果，那么他们的多图像结果肯定也会比最先进的要好得多？

kouteiheika: Just tried to run this using their sample script on my 4090 (which has 24GB of VRAM). It ran for a little over 1 minute and crashed with an out-of-memory error. I tried both SV3D_u and SV3D_p models.[edit]Managed to generate by tweaking the script to generate less frames simultaneously. 19.5GB peak VRAM usage, 1 min 25 secs to generate at 225 watts.[/edit]

kouteiheika: 只是试着在我的4090（它有24GB的VRAM）上使用他们的示例脚本来运行这个。它运行了1分钟多一点，由于内存不足而崩溃。我尝试了SV3D_u和SV3D_p两种模型 [edit]通过调整脚本来同时生成更少的帧，从而实现了生成。19.5GB峰值VRAM使用量，1分25秒以225瓦的功率产生。[&#x2F；编辑]

nbzso: Billions purred into technology with minimal use case application.
What is the direct implication of this tech?
Porn on demand?

nbzso: 数十亿美元投入到使用最少用例应用程序的技术中。这项技术的直接含义是什么？色情点播？

Filligree: If the animations shown are representative, then the mesh output may very well be good enough to use in a 3d printer.Looking forward to experimenting with this.

Filligree: 如果显示的动画具有代表性，那么网格输出可能非常好，足以在3d打印机中使用 期待着对此进行试验。

ionwake: Im sorry for dumb lazy question. But would the input require more than one image? Is there a demo url to test this? I think it might jsut be time to buy a 3d printer.EDIT> Does "single image inputs" mean more than one image?

ionwake: 我很抱歉问了这个愚蠢而懒惰的问题。但是输入是否需要多个图像？有没有一个演示url来测试这个？我想也许是时候买一台3d打印机了 编辑&gt；是否“；单个图像输入“；意思是不止一张图片？