【Hacker News搬运】Show HN：延迟不到1秒的实时AI视频代理

hackernews

Title: Show HN: A real time AI video agent with under 1 second of latency

Show HN：延迟不到1秒的实时AI视频代理

Text: Hey it’s Hassaan & Quinn – co-founders of Tavus, an AI research company and developer platform for video APIs. We’ve been building AI video models for ‘digital twins’ or ‘avatars’ since 2020.We’re sharing some of the challenges we faced building an AI video interface that has realistic conversations with a human, including getting it to under 1 second of latency.To try it, talk to Hassaan’s digital twin: <a href="https://www.hassaanraza.com" rel="nofollow">https://www.hassaanraza.com</a>, or to our "demo twin" Carter: <a href="https://www.tavus.io">https://www.tavus.io</a>We built this because until now, we've had to adapt communication to the limits of technology. But what if we could interact naturally with a computer? Conversational video makes it possible – we think it'll eventually be a key human-computer interface.To make conversational video effective, it has to have really low latency and conversational awareness. A fast-paced conversation between friends has ~250 ms between utterances, but if you’re talking about something more complex or with someone new, there is additional “thinking” time. So, less than 1000 ms latency makes the conversation feel pretty realistic, and that became our target.Our architecture decisions had to balance 3 things: latency, scale, & cost. Getting all of these was a huge challenge.The first lesson learned was to make it low-latency, we had to build it from the ground up. We went from a team that cared about seconds to a team that counts every millisecond. We also had to support thousands of conversations happening all at once, without getting destroyed on compute costs.For example, during early development, each conversation had to run on an individual H100 in order to fit all components and model weights into GPU memory just to run our Phoenix-1 model faster than 30fps. This was unscalable & expensive.We developed a new model, Phoenix-2, with a number of improvements, including inference speed. We switched from a NeRF based backbone to Gaussian Splatting for a multitude of reasons, one being the requirement that we could generate frames faster than realtime, at 70+ fps on lower-end hardware.

We exceeded this and focused on optimizing memory and core usage on GPU to allow for lower-end hardware to run it all. We did other things to save on time and cost like using streaming vs batching, parallelizing processes, etc. But those are stories for another day.We still had to lower the utterance-to-utterance time to hit our goal of under a second of latency. This meant each component (vision, ASR, LLM, TTS, video generation) had to be hyper-optimized.The worst offender was the LLM. It didn’t matter how fast the tokens per second (t/s) were, it was the time-to-first token (tfft) that really made the difference. That meant services like Groq were actually too slow – they had high t/s, but slow ttft. Most providers were too slow.The next worst offender was actually detecting when someone stopped speaking. This is hard. Basic solutions use time after silence to ‘determine’ when someone has stopped talking. But it adds latency. If you tune it to be too short, the AI agent will talk over you. Too long, and it’ll take a while to respond. The model had to be dedicated to accurately detecting end-of-turn based on conversation signals, and speculating on inputs to get a head start.We went from 3-5 to <1 second (& as fast as 600 ms) with these architectural optimizations while running on lower-end hardware.All this allowed us to ship with a less than 1 second of latency, which we believe is the fastest out there. We have a bunch of customers, including Delphi, a professional coach and expert cloning platform. They have users that have conversations with digital twins that span from minutes, to one hour, to even four hours (!) - which is mind blowing, even to us.Thanks for reading! let us know what you think and what you would build. If you want to play around with our APIs after seeing the demo, you can sign up for free from our website <a href="https://www.tavus.io">https://www.tavus.io</a>.

嘿，我是哈桑&amp；Quinn-Tavus的联合创始人，Tavus是一家人工智能研究公司和视频API开发平台。自2020年以来，我们一直在为“数字双胞胎”或“化身”构建人工智能视频模型。<p>我们分享了我们在构建与人类进行真实对话的人工智能视频界面时面临的一些挑战，包括将延迟控制在1秒以下<p> 要尝试一下，请与Hassaan的数字孪生兄弟联系：<a href=“https:”www.hassaanraza.com“rel=”nofollow“>https:”&#x2F；www.hassaanraza.com</a>，或我们的&quot；演示双胞胎”；Carter:<a href=“https://www.tavus.io”>https:&#x2F；www.tavus.io</a><p>我们创建这个是因为到目前为止，我们；我必须使沟通适应技术的极限。但是，如果我们能与计算机自然互动呢？对话视频使之成为可能——我们认为这是可能的；最终将成为关键的人机界面<p> 为了使对话视频有效，它必须具有非常低的延迟和对话意识。朋友之间的快节奏对话每句话之间间隔约250毫秒，但如果你谈论的是更复杂的事情或与新朋友交谈，则会有额外的“思考”时间。因此，小于1000毫秒的延迟使对话感觉非常逼真，这成为了我们的目标<p> 我们的架构决策必须平衡三件事：延迟、规模和；成本。获得所有这些都是一个巨大的挑战<p> 我们学到的第一个教训是要使它具有低延迟，我们必须从头开始构建它。我们从一个关心秒的团队变成了一个重视每一毫秒的团队。我们还必须支持同时发生的数千次对话，而不会在计算成本上受到破坏<p> 例如，在早期开发阶段，每次对话都必须在单独的H100上运行，以便将所有组件和模型权重放入GPU内存中，从而以超过30fps的速度运行我们的Phoenix-1模型。这是不可扩展的&；昂贵<p> 我们开发了一个新模型Phoenix-2，它有许多改进，包括推理速度。出于多种原因，我们从基于NeRF的骨干网切换到高斯散帧，其中一个原因是我们需要在低端硬件上以70+fps的速度生成比实时更快的帧。我们超越了这一点，专注于优化GPU上的内存和内核使用，以允许低端硬件运行这一切。我们还做了其他事情来节省时间和成本，比如使用流媒体与批处理、并行处理等。但这些都是另一天的故事<p> 我们仍然必须降低说话到说话的时间，才能达到我们的延迟时间不到一秒的目标。这意味着每个组件（视觉、ASR、LLM、TTS、视频生成）都必须进行超优化<p> 最坏的罪犯是法学硕士。无论每秒令牌的速度有多快，真正起作用的是第一个令牌的时间（tfft）。这意味着像Groq这样的服务实际上太慢了——它们的t值很高；s、 但ttft速度较慢。大多数供应商都太慢了<p> 下一个最严重的冒犯者实际上是检测到有人停止说话。这很难。基本的解决方案是使用沉默后的时间来“确定”某人何时停止说话。但它增加了延迟。如果你把它调得太短，人工智能代理就会对你说话。太久了，需要一段时间才能回复。该模型必须致力于根据对话信号准确检测回合结束，并推测输入以获得领先优势<p> 我们从3-5上升到&lt；在低端硬件上运行时，这些架构优化可以在1秒内（快至600毫秒）完成<p> 所有这些使我们能够以不到1秒的延迟发货，我们认为这是最快的。我们有很多客户，包括Delphi，一个专业的教练和专家克隆平台。他们的用户与数字双胞胎的对话时间从几分钟到一个小时，甚至四个小时（！）——这甚至对我们来说都是令人兴奋的。<p>谢谢阅读！让我们知道你的想法和你会建造什么。如果您想在观看演示后试用我们的API，可以从我们的网站免费注册<a href=“https:”www.tavus.io“>https:”&#x2F；www.tavus.io</a>。

hn link

Url:

Post by: hassaanr

Comments:

d2049: Is anyone else thinking that it might not be a good idea to give away your voice and face to a startup that is making digital clones of people?

d2049: 还有人认为，把你的声音和脸交给一家正在进行数字克隆的初创公司可能不是一个好主意吗？

causal: 1) Your website, and the dialup sounds, might be my favorite thing about all of this. I also like the cowboy hat.2) Maybe it's just degrading under load, but I didn't think either chat experience was very good. Both avatars interrupted themselves a lot, and the chat felt more like a jumbled mess of half-thoughts than anything.3) The image recognition is pretty good though, when I could get one of the avatars to slow down long enough to identify something I was holding.Anyway great progress, and thanks for sharing so much detail about the specific hurdles you've faced. I'm sure it'll get much better.

causal: 1）你的网站和拨号音可能是我最喜欢的。我也喜欢牛仔帽 2）也许吧；这只是在负载下降级，但我没有；我不认为这两种聊天体验都很好。两个化身都打断了自己很多，聊天感觉更像是一堆混乱的半思想 3）不过，当我可以让其中一个化身减速足够长的时间来识别我拿着的东西时，图像识别效果相当好 不管怎样，取得了很大的进步，感谢您分享了关于您遇到的具体障碍的这么多细节；我面对。我；我确定；会好得多的。

pookeh: I joined while in the bathroom where the camera was facing upwards looking up to the hanging towel on the wall…and it said “looks like you got a cozy bathroom here”You have to be kidding me.

pookeh: 我在浴室里加入了进来，相机朝上，抬头看着墙上挂着的毛巾……它说“看起来你在这里有一个舒适的浴室”你一定在开玩笑。

karolist: Felt like talking to a person, I couldn't bring myself to treat it like a piece of code, that's how real it felt. I wanted to be polite and diplomatic, caught myself thinking about "how I look to this person". This brought me thinking of the conscious effort we put in when we talk with people and how sloppy and relaxed we can be when interacting with algorithms.For a little example, when searching Google I default to a minimal set of keywords required to get the result, instead of typing full sentences. I'm sort of afraid this technology will train people to behave like that when video chatting with virtual assistants and that attitude will bleed in real life interactions in societies.

karolist: 感觉想和一个人说话，我不能；我不会把它当作一段代码来对待；感觉多么真实。我想表现得有礼貌、有外交手腕，突然想到&quot；我如何看待这个人&quot；。这让我想到了我们在与人交谈时所付出的有意识的努力，以及我们在与算法互动时是多么的草率和放松 举个小例子，在谷歌搜索时，我默认使用获得结果所需的最小关键字集，而不是键入完整的句子。我；我有点担心这项技术会训练人们在与虚拟助手进行视频聊天时的行为，而这种态度会在社会的现实生活互动中流失。

dools: Pretty cool, except Digital Hasaan has lots of trouble with my correcting the pronounciation of my name and looks and sounds like he is trying to seduce me.

dools: 很酷，除了Digital Hasaan在纠正我的名字发音方面遇到了很多麻烦，而且看起来和听起来都像是他在引诱我。