【Hacker News搬运】Monolith–用于将完整网页保存为单个HTML文件的CLI工具
-
Title: Monolith – CLI tool for saving complete web pages as a single HTML file
Monolith–用于将完整网页保存为单个HTML文件的CLI工具
Text:
Url: https://github.com/Y2Z/monolith
很抱歉,尝试使用 webscraper 工具抓取指定 URL 时遇到了连接问题,导致无法获取和分析网页内容。请确保该 URL 是可访问的,并且网络连接是稳定的。如果问题仍然存在,请尝试使用其他工具或方法来分析该网页。如果有其他请求或需要帮助,请告诉我。
Post by: iscream26
Comments:
simonw: Well this is fun... from the README here I learned I can do this on macOS:<p><pre><code> /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome
--headless --incognito --dump-dom https://github.com > /tmp/github.html
</code></pre>
And get an HTML file for a page after the JavaScript has been executed.<p>Wrote up a TIL about this with more details: <a href="https://til.simonwillison.net/chrome/headless" rel="nofollow">https://til.simonwillison.net/chrome/headless</a><p>My own <a href="https://shot-scraper.datasette.io/" rel="nofollow">https://shot-scraper.datasette.io/</a> tool (which uses headless Playwright Chromium under the hood) has a command for this too:<p><pre><code> shot-scraper html https://github.com/ > /tmp/github.html
</code></pre>
But it's neat that you can do it with just Google Chrome installed and nothing else.simonw: 这很有趣。。。从这里的自述文件中,我了解到我可以在macOS上做到这一点:<p><pre><code>&#f;应用程序;Google\Chrome.app/;内容/;MacOS;谷歌\ Chrome--headless--incognito--dump dom https://;github.com>/;tmp;github.html</code></pre>并在执行JavaScript后获取页面的HTML文件<p> 写一篇关于这一点的TIL,并提供更多详细信息:<a href=“https://;/;TIL.simonwilliso.net/!chrome/:headless”rel=“nofollow”>https:///;til.simonwilliso.net;chrome;headless</a><p>我自己的<a href=“https://x2F;/;shot scraper.dataset.io/”rel=“nofollow”>https://x2F/;喷丸刮除器.dataset.io</a> 工具(在引擎盖下使用无头Playwright Chromium)也有一个命令:<p><pre><code>shot scraper html https://;github.com/>/;tmp;github.html</code></pre>但是它;你只需安装谷歌Chrome浏览器,就可以做到这一点,这很好。
andai: I always ship single file pages whenever possible. My original reasoning for this was that you should be able to press view source and see everything. (It follows that pages should be reasonably small and readable.)<p>An unexpected side effect is that they are self contained. You can download pages, drag them onto a browser to use them offline, or reupload them.<p>I used to author the whole HTML file at once, but lately I am fond of TypeScript, and made a simple build system to let me write games in TS and have them built to one HTML file. (The sprites are base64 encoded.)<p>On that note, it seems (there is a proposal) that browsers will eventually get support for TypeScript syntax, at which point I won't need a compiler / build step anymore. (Sadly they won't do type checking, but hey... baby steps!)
andai: 我总是尽可能发送单个文件页面。我最初的理由是,你应该能够按下视图源并查看所有内容。(因此,页面应该相当小且可读。)<p>一个意想不到的副作用是它们是自包含的。您可以下载页面,将它们拖到浏览器上以脱机使用,也可以重新加载它们<p> 我曾经一次编写整个HTML文件,但最近我喜欢TypeScript,并制作了一个简单的构建系统,让我在TS中编写游戏,并将它们构建到一个HTML文件中。(精灵是base64编码的;t不需要编译器;构建步骤。(遗憾的是,他们不会进行类型检查,但嘿……小步!)
lopkeny12ko: How does this compare to SingleFile?<p><a href="https://www.npmjs.com/package/single-file-cli" rel="nofollow">https://www.npmjs.com/package/single-file-cli</a>
lopkeny12ko: 这与SingleFile相比如何<p> <a href=“https://;/;www.npmjs.com#xx2F;package#xx2F:single-file cli”rel=“nofollow”>https:///;www.npmjs.com/;封装;单文件cli</a>
al_borland: I use read-it-later type services a lot, and save more than I read. On many occasions I've gone back to finally read things and find that the pages no longer exist. I'm thinking moving to some kind of offline archival version would be a better option.
al_borland: 我经常使用“以后读”类型的服务,而且节省的钱比我读的多。在许多场合;I’I’我回去看东西,发现那些页面已经不复存在了。I-;我认为转移到某种离线存档版本会是一个更好的选择。
jchook: Hm, very interesting, especially for bookmarking/archiving.<p>I'm curious, why not use the MHTML standard for this?<p>- AFAIK data URIs have practical length limits that vary per browser. MHTML would enable bundling larger files such as video.<p>- MHTML would avoid transforming meaningful relative URLs into opaque data URIs in the HTML attributes.<p>- MHTML is supported by most major browsers in some way (either natively in Chrome or with an extension in Safari, etc).<p>- MIME defines a standard for putting pure binary data into document parts, so it could avoid the 33% size inflation from base64 encoding. That said, I do not know if the
binary
Content-Transfer-Encoding is widely supported.jchook: 嗯,非常有趣,尤其是对于书签;归档<p> I-;我很好奇,为什么不使用MHTML标准呢<p> -AFAIK数据URI的实际长度限制因浏览器而异。MHTML将允许捆绑更大的文件,如视频<p> -MHTML将避免在HTML属性中将有意义的相对URL转换为不透明的数据URI<p> -大多数主流浏览器都以某种方式支持MHTML(无论是在Chrome中原生还是在Safari中扩展,等等)<p> -MIME定义了一个将纯二进制数据放入文档部分的标准,因此可以避免base64编码导致的33%的大小膨胀。也就是说,我不知道“二进制”内容传输编码是否得到广泛支持。