【Hacker News搬运】当不完美的系统是好的:Bluesky的有损时间线
-
Title: When imperfect systems are good: Bluesky's lossy timelines
当不完美的系统是好的:Bluesky的有损时间线
Text:
Url: https://jazco.dev/2025/02/19/imperfection/
很抱歉,作为一个文本处理的AI,我无法直接访问或分析外部链接内容。JinaReader可能是一个特定的工具或服务,但我无法使用它来抓取或分析链接中的内容。 不过,如果您能够提供该网页上的文本内容,我可以帮助您进行抓取、分析并总结。如果您需要帮助理解非中文内容,我可以使用在线翻译服务来将内容翻译成中文,然后进行分析和总结。请提供需要翻译和总结的文本内容。
Post by: cyndunlop
Comments:
ChuckMcM: As a systems enthusiast I enjoy articles like this. It is really easy to get into the mindset of "this must be perfect".<p>In the Blekko search engine back end we built an index that was 'eventually consistent' which allowed updates to the index to be propagated to the user facing index more quickly, at the expense that two users doing the exact same query would get slightly different results. If they kept doing those same queries they would eventually get the exact same results.<p>Systems like this bring in a lot of control systems theory because they have the potential to oscillate if there is positive feedback (and in search engines that positive feedback comes from the ranker which is looking at which link you clicked and giving it a higher weight) and it is important that they not go crazy. Some of the most interesting, and most subtle, algorithm work was done keeping that system "critically damped" so that it would converge quickly.<p>Reading this description of how user's timelines are sharded and the same sorts of feedback loops (in this case 'likes' or 'reposts') sounds like a pretty interesting problem space to explore.
ChuckMcM: 作为一名系统爱好者,我喜欢这样的文章。很容易形成";这必须是完美的"<p> 在Blekko搜索引擎后端,我们构建了一个索引;最终一致;这允许索引的更新更快地传播到面向用户的索引,代价是两个用户执行完全相同的查询会得到略有不同的结果。如果他们继续做同样的查询,他们最终会得到完全相同的结果<p> 像这样的系统引入了很多控制系统理论,因为如果有正反馈,它们就有可能振荡(在搜索引擎中,正反馈来自排名者,排名者会查看你点击的链接并赋予它更高的权重),重要的是它们不要发疯。一些最有趣、最微妙的算法工作是在保持该系统的情况下完成的。";严重阻尼”;以便它能够快速收敛<p> 阅读用户如何;s的时间线是分片的,同样类型的反馈循环(在这种情况下为“点赞”或“转发”)听起来是一个非常有趣的问题空间。
pornel: I wonder why timelines aren't implemented as a hybrid gather-scatter choosing strategy depending on account popularity (a combination of fan-out to followers and a lazy fetch of popular followed accounts when follower's timeline is served).<p>When you have a celebrity account, instead of fanning out every message to millions of followers' timelines, it would be cheaper to do nothing when the celebrity posts, and later when serving each follower's timeline, fetch the celebrity's posts and merge them into the timeline. When millions of followers do that, it will be cheap read-only fetch from a hot cache.
pornel: 我想知道为什么时间表不是;t被实现为取决于账户受欢迎程度的混合聚集-分散选择策略(粉丝粉丝的扇出和在关注者的时间轴被服务时对受欢迎的关注账户的懒惰获取的组合)<p> 当你拥有一个名人账户时,与其将每条消息分散给数百万粉丝;在时间线上,名人发帖时什么都不做会更便宜,稍后为每个粉丝服务时也会更便宜;s时间表,获取名人;s的帖子,并将其合并到时间线中。当数以百万计的追随者这样做时,它将是从热缓存中廉价的只读获取。
spoaceman7777: Hmm. Twitter/X appears to do this at quite a low number, as the "Following" tab is incredibly lossy (some users are permanently missing) at only 1,200 followed people.<p>It's <i>insanely</i> frustrating.<p>Hopefully you're adjusting the lossy-ness weighting and cut-off by whether a user is active at any particular time? Because, otherwise, applying this rule, if the cap is set too low, is a very bad UX in my experience x_x
spoaceman7777: 嗯,推特;X似乎这样做的次数很少,因为";以下";标签的损耗非常大(一些用户永久丢失),只有1200人关注<p> 它;s<i>非常令人沮丧<p> 希望你;根据用户在任何特定时间是否处于活动状态来重新调整有损权重和截止值?因为,否则,如果上限设置得太低,根据我的经验,应用这条规则是一个非常糟糕的用户体验x_x
rakoo: Ok I'm curious: since this strategy sacrifices consistency, has anyone thoughts about something that is not full fan-out on reads or on writes ?<p>Let's imagine something like this: instead of writing to every user's timeline, it is written once for each shard containing at least one follower. This caps the fan-out at write time to hundreds of shards. At read time, getting the content for a given users reads that hot slice and filters actual followers. It definitely has more load but<p>- the read is still colocated inside the shard, so latency remains low<p>- for mega-followers the page will not see older entries anyway<p>There are of course other considerations, but I'm curious about what the load for something like that would look like (and I don't have the data nor infrastructure to test it)
rakoo: 好的,我;我很好奇:既然这种策略牺牲了一致性,有人想过在阅读或写作上不完全展开的东西吗<p> 让;我们想象一下这样的事情:而不是给每个用户写信;在时间轴上,每个包含至少一个追随者的分片都写一次。这使得在写入时扇出的碎片达到数百个。在读取时,获取给定用户的内容会读取该热门片段并过滤实际关注者。它肯定有更多的负载,但<p>-读取仍然在分片内托管,因此延迟仍然很低<p>-对于超级关注者来说,页面无论如何都不会看到旧条目<p>当然还有其他考虑因素,但我;我很好奇这样的东西的负载会是什么样子(而且我没有数据也没有基础设施来测试它)
jadbox: So, let's say I follow 4k people in the example and have a 50% drop rate. It seems a bit weird that if all (4k - 1) accounts I follow end up posting nothing in a day, that I STILL have a 50% chance that I won't see the 1 account that posts in a day. It seems to me that the algorithm should consider my feed's age (or the post freshness of my followers). Am I overthinking?
jadbox: 那么,让我们;假设我以4000人为例,下降率为50%。如果我关注的所有(4k-1)帐户在一天内都没有发布任何内容,我仍然有50%的机会获胜,这似乎有点奇怪;我看不到一天内发布的1个帐户。在我看来,算法应该考虑我的提要;她的年龄(或我的追随者的新鲜度)。我是不是想得太多了?