【Hacker News搬运】对于好奇人工智能的应用程序开发人员来说，嵌入是一个很好的起点

hackernews

Title: Embeddings are a good starting point for the AI curious app developer

对于好奇人工智能的应用程序开发人员来说，嵌入是一个很好的起点

Text:

Url: https://bawolf.substack.com/p/embeddings-are-a-good-starting-point

文章的标题是 "Embeddings are a good starting point for the AI curious app developer"，作者分享了关于在应用开发中使用向量嵌入的经验。文章提到，向量嵌入是一种压缩了大量人类知识的数据表示方式，它可以将以前需要专门项目才能实现的功能简化，使得产品工程师能够独立完成。

文章强调了向量嵌入在搜索和推荐系统中的作用，因为它们能够很好地测量与任意输入的相似性。这种技术甚至适用于不同的语言，如法语或日语。作者还介绍了他们选择使用的工具和库，包括 Postgres 扩展 pgvector、Typescript 和 drizzle-orm，以及 OpenAI 的文本嵌入模型。

作者详细描述了如何将图标数据编码为向量嵌入，以及如何使用余弦相似度来衡量搜索查询和每个图标的嵌入之间的相似性。他们还展示了一个简单的排名算法，结合了嵌入搜索和用户点击数据，以改进搜索结果。

文章还提到了如何将向量数据库与其他数据库结合使用，以及如何在应用中集成向量嵌入。作者强调，使用向量嵌入可以简化开发过程，并提供了其他选择和资源，以便读者进一步探索和实现自己的项目。

总的来说，这篇文章为对人工智能感兴趣的应用开发者提供了一个很好的起点，展示了如何将向量嵌入技术应用于实际项目中，以及如何选择合适的工具和库来实现这些功能。

Post by: bryantwolf

Comments:

suprgeek: Great project and excellent initiative to learn about embeddings.
Two possible avenues to explore more.
Your system backend could be thought of as being composed of two parts:
|Icons->Embedder->|PGVector|->Retriever->Display Result|1. In the embedder part trying out different embedding models and/or vector dimensions to explore if the Recall@K & Precision@K for your data set (icons) improves. Models make a surprising amount of difference to the quality of the results. Try the MTEB Leaderboard for ideas on which models to explore.2. In the Information Retriever part you can try a couple of approaches:
a.after you retrieve from PGVector see if you can use a reranker like Cohere to get better results <a href="https://cohere.com/blog/rerank" rel="nofollow">https://cohere.com/blog/rerank</a>b.You could try a "fusion ranking" similar to the one you do but structured such that 50% of the weight is for a plain old keyword search in the metadata and 50% is for the embedding based searchFinally something more interesting to noodle on - what if the embeddings were based on the icon images and the model knew how to search for a textual descriptions in the latent space?

suprgeek: 伟大的项目和优秀的学习嵌入的主动性。两种可能的探索途径。您的系统后端可以被认为由两部分组成：|图标-&gt；嵌入程序-&gt|PGVector |-&gt；寻回器-&gt；显示结果|1。在嵌入器部分尝试不同的嵌入模型；或向量维度来探索Recall@K&amp；Precision@K为您的数据集（图标）改进。模型对结果的质量产生了惊人的影响。试试MTEB排行榜，了解探索哪些模型的想法 2。在信息检索器部分，您可以尝试以下几种方法：a.从PGVector检索后，看看是否可以使用类似Cohere的重新排序器来获得更好的结果<a href=“https://；&#x2F；Cohere.com&#x2F！blog&#x2F：重新排序”rel=“nofollow”>https://&#x2F；cohere.com&#x2F；博客&#x2F；rerank</a>b。你可以尝试一个“；融合排名”；与您所做的类似，但其结构使得50%的权重用于元数据中的普通旧关键字搜索，50%用于基于嵌入的搜索最后还有一件更有趣的事情要讨论——如果嵌入是基于图标图像的，并且模型知道如何在潜在空间中搜索文本描述呢？

benreesman: Without getting into any big debates about whether or not RAG is medium-term interesting or whatever, you can ‘pip install sentence-transformers faiss’ and just immediately start having fun. I recommend using straightforward cosine similarity to just crush the NYT’s recommender as a fun project for two reasons: there’s an API and plenty of corpus, and it’s like, whoa, that’s better than the New York Times.He’s trying to sell a SaaS product (Pinecone), but he’s doing it the right way: it’s ok to be an influencer if you know what you’re taking about.James Briggs has great stuff on this: <a href="https://youtube.com/@jamesbriggs" rel="nofollow">https://youtube.com/@jamesbriggs</a>

benreesman: 在不卷入任何关于RAG是否中期有趣或其他什么的大争论的情况下，你可以“安装句子转换器faiss”，并立即开始享受乐趣。我建议使用直接的余弦相似性来粉碎《纽约时报》的推荐器，这是一个有趣的项目，原因有两个：有一个API和大量的语料库，就像，哇，这比《纽约时报报》更好 他正试图销售SaaS产品（松果），但他做这件事的方式是正确的：如果你知道自己在做什么，成为一个有影响力的人是可以的 James Briggs在这方面做得很好：<a href=“https://；&#x2F；youtube.com&#x2F！@jamesbriggs”rel=“nofollow”>https://&#x2F；youtube.com&#x2F@詹姆斯布里格斯</a>

thisiszilff: One straightforward way to get started is to understand embedding without any AI/deep learning magic. Just pick a vocabulary of words (say, some 50k words), pick a unique index between 0 and 49,999 for each of the words, and then produce embedding by adding +1 to the given index for a given word each time it occurs in a text. Then normalize the embedding so it adds up to one.Presto -- embeddings! And you can use cosine similarity with them and all that good stuff and the results aren't totally terrible.The rest of "embeddings" builds on top of this basic strategy (smaller vectors, filtering out words/tokens that occur frequently enough that they don't signify similarity, handling synonyms or words that are related to one another, etc. etc.). But stripping out the deep learning bits really does make it easier to understand.

thisiszilff: 一种简单的入门方法是在没有任何AI的情况下理解嵌入；深度学习魔法。只需选择一个单词词汇表（比如大约5万个单词），为每个单词选择一个0到49999之间的唯一索引，然后在文本中每次出现给定单词时，通过在给定索引上加+1来生成嵌入。然后归一化嵌入，使其加起来为一 Presto——嵌入！你可以使用它们的余弦相似性，以及所有这些好的东西，结果是；这并不完全可怕 其余的“；嵌入”；建立在这个基本策略之上（较小的向量，过滤掉频繁出现的单词-令牌，使其不表示相似性，处理同义词或彼此相关的单词，等等）。但是，去掉深度学习的部分确实会让它更容易理解。

voxelc4L: It begs the question though, doesn't it...? Embeddings require a neural network or some reasonable facsimile to produce the embedding in the first place. Compression to a vector (a semantic space of some sort) still needs to happen – and that's the crux of the understanding/meaning. To just say "embeddings are cool let's use them" is ignoring the core problem of semantics/meaning/information-in-context etc. Knowing where an embedding came from is pretty damn important.Embeddings live a very biased existence. They are the product of a network (or some algorithm) that was trained (or built) with specific data (and/or code) and assume particular biases intrinsically (network structure/algorithm) or extrinsically (e.g., data used to train a network) which they impose on the translation of data into some n-dimensional space. Any engineered solution always lives with such limitations, but with the advent of more and more sophisticated methods for the generation of them, I feel like it's becoming more about the result than the process. This strikes me as problematic on a global scale... might be fine for local problems but could be not-so-great in an ever changing world.

voxelc4L: 然而，这引出了一个问题，不是吗；不是吗。。。？嵌入首先需要一个神经网络或一些合理的传真来产生嵌入。对向量（某种语义空间）的压缩仍然需要发生；s理解的关键；意思只是说“；嵌入很酷，let；我们使用它们”；忽略了语义的核心问题；意味着；上下文中的信息等等。知道嵌入来自哪里是非常重要的 嵌入是一种非常有偏见的存在。它们是用特定数据（和/或代码）训练（或构建）的网络（或某些算法）的产物，并在本质上（网络结构-算法）或外在上（例如，用于训练网络的数据）假设它们施加在将数据转换到某个n维空间中的特定偏差。任何工程解决方案总是存在这样的局限性，但随着越来越复杂的生成方法的出现，我觉得它；与其说是过程，不如说是结果。这让我觉得在全球范围内存在问题。。。可能对局部问题很好，但在一个不断变化的世界里可能就不那么好了。

mrkeen: Given<pre><code> not because they’re sufficiently advanced technology indistinguishable from magic, but the opposite.

Unlike LLMs, working with embeddings feels like regular deterministic code.

<h3>Creating embeddings</h3>
</code></pre>
I was hoping for a bit more than:<pre><code> They’re a bit of a black box

Next, we chose an embedding model. OpenAI’s embedding models will probably work just fine.</code></pre>

mrkeen: 给定<pre><code>，并不是因为它们是与魔术无法区分的足够先进的技术，而是相反。与LLM不同，使用嵌入感觉像是常规的确定性代码。&lt；h3&gt；创建嵌入&lt&#x2F；h3&gt；</code></pre>我希望的不仅仅是：<pre><code>它们有点像黑盒子接下来，我们选择了一个嵌入模型。OpenAI的嵌入模型可能会很好地工作</代码></pre>