【Hacker News搬运】显示HN:Visprex–CSV文件的开源浏览器内数据可视化工具

hackernews

Title: Show HN: Visprex – Open-source, in-browser data visualisation tool for CSV files

显示HN:Visprex–CSV文件的开源浏览器内数据可视化工具

Text: Hello HN. I've always found writing data visualisation scripts boring and repetitive in data science workflows earlier in my career, so I built this tool to automate it. The available methods are based on my experience in econometrics where histograms and scatterplots were the starting points to check data distributions.<p>The link is to the documentation and the app is freely available at <a href="https://visprex.com" rel="nofollow">https://visprex.com</a>, and if you're curious about the implementation it's open source at <a href="https://github.com/visprex/visprex">https://github.com/visprex/visprex</a>. I'd appreciate any comments and feedback!

你好，HN；在我职业生涯的早期，我一直觉得在数据科学工作流程中编写数据可视化脚本既无聊又重复，所以我构建了这个工具来自动化它。可用的方法是基于我在计量经济学方面的经验，直方图和散点图是检查数据分布的起点<p> 该链接指向文档，该应用程序可在<a href=“https:”visprex.com“rel=”nofollow“>https:”免费获得&#x2F；visprex.com</a>，如果您；对它的实现很好奇；s开源于<a href=“https:&#x2F；&#x2F; github.com&#x2F-visprex&#x2F/visprex”>https:&quot&#x2F；github.com；visprex™；visprex</a>。我；我感谢您的任何意见和反馈！

hn link

Url: https://docs.visprex.com/

由于我无法直接访问网络上的内容，我无法直接使用JinaReader或任何其他工具来抓取和分析上述链接（https://docs.visprex.com/）的内容。

但是，我可以提供一些关于如何使用JinaReader进行内容抓取和分析的一般步骤，以及如何处理非中文内容的方法：

1. **安装JinaReader**：
   首先，你需要确保已经安装了JinaReader。通常，你可以使用pip来安装：
   ```bash
   pip install jinareader

抓取内容：
使用JinaReader抓取网页内容，可以使用以下命令：

from jinareader import JinaReader

reader = JinaReader()
content = reader.read('https://docs.visprex.com/')
print(content)

这将输出网页的HTML内容。

分析内容：
一旦有了HTML内容，你可以使用各种库来分析它，例如BeautifulSoup：

from bs4 import BeautifulSoup

soup = BeautifulSoup(content, 'html.parser')
# 分析网页内容，例如提取标题、链接、文本等
titles = soup.find_all('h1')  # 假设你想要提取标题
for title in titles:
    print(title.text)

处理非中文内容：
如果内容不是中文，你可以使用翻译API将内容翻译成中文。这里以Google Translate API为例：
```
from googletrans import Translator

translator = Translator()
text_to_translate = "This is the text you want to translate."
translation = translator.translate(text_to_translate, dest='zh-cn')
print(translation.text)
```
在实际应用中，你需要替换text_to_translate变量为你想要翻译的文本。
总结内容：
为了总结内容，你可以使用自然语言处理（NLP）库，如NLTK或spaCy，来提取关键信息：
```
import spacy

nlp = spacy.load('zh_core_web_sm')  # 使用中文模型
doc = nlp("This is a summary of the content.")
summary = " ".join([sent.text for sent in doc.sents])
print(summary)
```
注意，这里使用的是中文模型zh_core_web_sm，如果处理的是非中文内容，需要选择相应的语言模型。

请注意，上述代码仅为示例，实际使用时可能需要根据具体情况进行调整。此外，处理网页内容时，需要遵守相应的网站爬虫政策。

        
## Post by: kengoa
        
### Comments: 
        
**paddy_m**: Nice work!<p>Do you have any plans for data cleaning?<p>I am working on a somewhat similar open source project.  I intend to add heuristic data cleaning.  With the UI I want to be able to toggle between different strategies quickly - strip characters from a column to treat it as numeric, if less than 2% or 5% of values have a character, fill na with mean, interpret dates in different formats - drop if the date doesn&#x27;t parse.  The idea bing that if it&#x27;s really quick to change between different strategies, you can create more opinionated strategies to get to the right answer faster.<p>Happy to collaborate and talk tables with anyone who&#x27;s interested.
> **paddy_m**: 干得好<p> 你有数据清理的计划吗<p> 我正在做一个类似的开源项目。我打算添加启发式数据清理。使用UI，我希望能够快速在不同的策略之间切换-从列中删除字符以将其视为数字，如果少于2%或5%的值有字符，则用平均值填充na，以不同格式解释日期-如果日期没有，则删除；t解析。如果；在不同的策略之间很快就会发生变化，你可以制定更多固执己见的策略，更快地得到正确的答案<p> 很乐意与任何一位；他很感兴趣。
            
**parsimo2010**: I like this a lot- I am going to show it to my students!<p>They seem to hate learning R, and while this doesn’t prevent them from having to build a model, this will speed up the exploration steps.
> **parsimo2010**: 我非常喜欢这个，我要把它展示给我的学生<p> 他们似乎讨厌学习R，虽然这并不能阻止他们建立模型，但这将加快探索步骤。
            
**teddyh**: I loaded a CSV with one date&#x2F;time column and one numerical column.  I then selected “Scatter Plot”, but got the message “Not enough numerical columns found. Load a CSV file with at least 2 numerical columns in the Datasets tab.”  I would have thought that a date&#x2F;time column would count?
> **teddyh**: 我加载了一个包含一个日期的CSV；时间列和一个数字列。然后我选择了“散点图”，但收到消息“找不到足够的数字列。在数据集选项卡中加载一个至少有2个数字列的CSV文件。”；时间列会算数吗？
            
**rrr_oh_man**: Very cool stuff!<p>Maybe bar &#x2F; beeswarm charts would be useful?<p>I was missing the possibility to show differences by category, eg mpg by make in the cars dataset.
> **rrr_oh_man**: 很酷的东西<p> 也许吧；蜂群图有用吗<p> 我错过了按类别显示差异的可能性，例如在汽车数据集中按品牌显示mpg。
            
****: 
> ****: