【Hacker News搬运】Linux危机工具

hackernews

Title: Linux Crisis Tools

Linux危机工具

Text:

Url: https://www.brendangregg.com/blog/2024-03-24/linux-crisis-tools.html

Linux危机工具
2024年3月24日
当你的服务因性能问题而发生中断时，你不想浪费宝贵的时间来安装所需的诊断工具。我建议默认在您的Linux服务器上安装以下“危机工具”（如果它们还没有安装），并列出它们来自的（Ubuntu）包名：
包提供备注
procpsps(1)，vmstat(8)，uptime(1)，top(1)基本统计
util-linuxdmesg(1)，lsblk(1)，lscpu(1)系统日志，设备信息
sysstatstat(1)，mpstat(1)，pidstat(1)，sar(1)设备统计
iproute2ip(8)，ss(8)，nstat(8)，tc(8)首选网络工具
numactlnumastat(8)NUMA统计
tcpdumptcpdump(8)网络嗅探器
linux-tools-commonlinux-tools-$(uname -r)性能分析器和PMU统计
bpfcc-tools (bcc)opensnoop(8)，execsnoop(8)，runqlat(8)，softirqs(8)，hardirqs(8)，ext4slower(8)，ext4dist(8)，biotop(8)，biosnoop(8)，biolatency(8)，tcptop(8)，tcplife(8)，trace(8)，argdist(8)，funccount(8)，profile(8)，等.预制的eBPF工具[1]
bpftracebpftrace，基本版本的opensnoop(8)，execsnoop(8)，runqlat(8)，biosnoop(8)等.eBPF脚本[1]
trace-cmdtrace-cmd(1)Ftrace命令行界面
nicstatnicstat(1)网络设备统计
ethtoolethtool(8)网络设备信息
tiptoptiptop(1)PMU/PMC顶部
cpuidcpuid(1)CPU详细信息
msr-toolsrdmsr(8)，wrmsr(8)CPU挖掘
（这是基于SysPerf 2中的表4.1“Linux危机工具”）
一些更长的注释： [1] bcc和bpftrace有许多重叠的工具：bcc的工具功能更强大（例如，命令行选项），而bpftrace的工具可以即时编辑。但这并不意味着一个比另一个更好或更快：它们产生相同的BPF字节码，一旦运行就是一样快。还要注意，bcc正在演变，并将工具从Python迁移到libbpf C（具有CO-RE和BTF），但我们还没有重新工作包。将来“bpfcc-tools”应该用更小的“libbpf-tools”包替换，这只包含工具二进制文件。
这个列表是最小的。一些服务器有加速器，您可能希望安装它们的分析工具：例如，在Intel GPU服务器上，intel-gpu-tools包；在NVIDIA上，nvidia-smi。调试工具，如gdb(1)，也可以预先安装以供紧急使用。
像这些基本的分析工具通常不会经常更改，所以这个列表可能每几年需要更新一次。如果您认为我漏了一个今天重要的包，请告诉我（例如，在评论中）。
添加这些包的主要缺点是它们在磁盘上的大小。在云实例上，将Mbytes添加到基础服务器映像可能会增加秒数，或 fractions of a second，到实例部署时间。幸运的是，我列出的包都非常小（并且bcc将变得更小），应该成本低廉且时间短。我看到这个大小担忧阻止了调试信息（总计约1 Gbyte）的默认包含。

我不能在需要时再安装它们吗？

在生产危机中尝试安装软件时可能会发生许多问题。我会通过一个虚构的例子来逐步说明我从中学习的一些困难：

下午4:00：警报！你们公司的网站宕机了。不，有些人说它还在那里。它还在吗？它还在，但太慢了，无法使用。

下午4:01：你查看监控仪表板，一组后端服务器异常。是高磁盘I/O导致的吗？是什么原因造成的？

下午4:02：你登录到一台服务器以深入了解，但SSH登录需要很长时间。

下午4:03：你得到了登录提示并输入“iostat -xz 1”以获取基本磁盘统计信息。有一个很长的停顿，然后“未找到命令'iostat'...尝试：sudo apt install sysstat”。呃。由于系统如此缓慢，

Post by: samber

Comments:

FridgeSeal: This is a handy list.> 4:07pm The package install has failed as it can't resolve the repositories. Something is wrong with the /etc/apt configuration…Cloud definitely has downsides, and isn’t a fit for all scenarios but in my experience it’s great for situations like this. Instead of messing around trying to repair it, simply kill the machine, or take it out of the pool. Get a new one. New machine and app likely comes up clean. Incident resolves. Dig into machine off the hot path.

FridgeSeal: 这是一个方便的列表 &gt；4:07pm软件包安装已失败；t解析存储库。&#x2F；出现问题；等等；apt配置…Cloud肯定有缺点，而且不适合所有场景，但根据我的经验，它非常适合这种情况。与其试图修理它，不如简单地杀死机器，或者把它从水池里拿出来。买一个新的。新机器和应用程序很可能会被清理干净。事件解决。从热通道挖出机器。

zer00eyz: The only thing I would add is nmap.Network connectivity issues aren't always apparent in some apps.

zer00eyz: 我唯一想补充的是nmap 网络连接问题是；在某些应用程序中并不总是显而易见。

mmh0000: I was surprised that strace wasn't on that list. That's usually one of my first go-to tools. It's so great, especially when programs return useless or wrong error messages.

mmh0000: 我很惊讶“strace”不是；t在那个名单上。那个；这通常是我第一次使用的工具之一。它；这太棒了，尤其是当程序返回无用或错误的错误消息时。

reilly3000: In such a crisis if installing tools is impossible, you can run many utils via Docker, such as:Build a container with a one-liner:docker build -t tcpdump - <<EOF \nFROM ubuntu \nRUN apt-get update && apt-get install -y tcpdump \nCMD tcpdump -i eth0 \nEOFRun attached to the host network:docker run -dP --net=host moremagic/docker-netstatRun system tools attached to read host processes:for sysstat_tool in iostat sar vmstat mpstat pidstat; do
alias "sysstat-${sysstat_tool}=docker run --rm -it -v /proc:/proc --privileged --net host --pid host ghcr.io/krishjainx/sysstat-docker:main /usr/bin/${sysstat_tool}"
done
unset -v sysstat_toolSure, yum install is preferred, but so long as docker is available this is a viable alternative if you can manage the extra mapping needed. It probably wouldn’t work with a rootless/podman setup.

reilly3000: 在这样的危机中，如果无法安装工具，您可以通过Docker运行许多utils，例如：＜p＞用一个liner:＜p＞Docker Build-t tcpdump-&lt&书信电报；EOF\nFROM ubuntu\nUN apt-get update&amp&amp；apt-get-install-y-tcpdump\nCMD-tcpdump-i eth0\nEOF连接到主机网络的运行：docker Run-dP--net=host moremagic&#x2F；docker netstat＜p＞运行附加到读取主机进程的系统工具：＜p＞用于iostat中的sysstat_tool sar vmstat mpstat pidstat；做别名“；sysstat-$｛sysstat_tool｝=docker run--rm-it-v&#x2F；proc:&#x2F；proc—privileged—net host—pid host ghcr.io；krishjainx&#x2F；sysstat docker:main&#x2F；usr&#x2F；bin${sysstat_tool}”；完成unset-v sysstat_tool＜p＞当然，最好使用yum-install，但只要docker可用，如果您可以管理所需的额外映射，这是一个可行的替代方案。它可能不适用于无根的；podman设置。

randomgiy3142: I use zfsbootmenu with hrmph (<a href="https://github.com/leahneukirchen/hrmpf">https://github.com/leahneukirchen/hrmpf</a>). You can see the list of packages here (<a href="https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.packages">https://github.com/leahneukirchen/hrmpf/blob/master/hrmpf.pa...</a>). I usually build images based off this so they’re all there, otherwise you’ll need to ssh into zfsbootmenu and load the 2 gb separate distro. This is for home server, though if I had a startup I’d probably setup a “cloud setup” and throw a bunch of servers somewhere. A lot of times for internal projects and even non-production client research having your own cluster is a lot cheaper and easier then paying for a cloud provider. It also gets around when you can’t run k8s and need bare metal. I’d advised some clients on this setup with contingencies in case of catastrophic failure and more importantly test those contingencies but this is more so you don’t have developers doing nothing not to prevent overnight outages. A lot cheaper than cloud solutions for non critical projects and while larger companies will look at the numbers closely if something happened and devs can’t work for an hour the advantage of a startup is devs will find a way to be productive locally or simply have them take the afternoon off (neither has happened).I imagine these problems described happen on big iron type hardware clusters that are extremely expensive and spare capacity isn’t possible. I might be wrong but especially with (sigh) AI setups with extremely expensive $30k GPUs and crazy bandwidth between planes you buy from IBM for crazy prices (hardware vendor on the line so quickly was a hint) you’re way past the commodity server cloud model. I have no idea what could go wrong with such equipment where nearly ever piece of hardware is close to custom built but I’m glad I don’t have to deal with that. The debugging on those things work hardware only a few huge pharma or research companies use has to come down to really strange things.

randomgiy3142: 我将zfsbootmenu与hrmph一起使用（<a href=“https://x2F；&#x2F；github.com#xx2F；leahneukirchen#xx2F！hrmpf”>https://x2F！&#x2F！github.com\xx2F；Leahneukitchen#xx2f；hrmpf</a>）。您可以在此处查看软件包列表（<a href=“https://；&#x2F；github.com&#x2F：leahneukirchen&#x2F，hrmpf&#x2F“blob&#x2F”master&#x2F！hrmpf.packages”>https://；#xx2F；github.com&#x20F；leahneuki rchen&#x20F，hrmpf&#x2F；blob&#x2F；master&#x20f；hrmpf.pa.…</a>）。我通常基于此构建映像，所以它们都在那里，否则您将需要ssh到zfsbootmenu中，并加载2GB的独立发行版。这是针对家庭服务器的，不过如果我有一个启动，我可能会设置一个“云设置”，并在某个地方扔一堆服务器。很多时候，对于内部项目，甚至非生产客户研究来说，拥有自己的集群比购买云提供商便宜得多，也更容易。当你不能运行K8并且需要裸金属时，它也会出现。我曾建议一些客户在发生灾难性故障时进行应急设置，更重要的是测试这些应急设置，但这更重要，这样开发者就不会不采取任何措施来防止一夜之间的停机。对于非关键项目来说，这比云解决方案便宜得多，虽然大公司会密切关注数字，如果发生了什么事情，开发人员一个小时都不能工作，但创业的好处是开发人员会找到一种在当地提高生产力的方法，或者干脆让他们休息一下午（两者都没有发生） 我想这些描述的问题发生在非常昂贵且不可能有备用容量的大型铁类硬件集群上。我可能错了，但尤其是（叹气）人工智能设置，它具有极其昂贵的3万美元GPU和你以疯狂的价格从IBM购买的飞机之间的疯狂带宽（硬件供应商这么快就上线了，这是一个暗示），你已经远远超过了商品服务器云模型。我不知道这种设备会出什么问题，因为几乎所有的硬件都是定制的，但我很高兴我不必处理这些问题。只有少数大型制药或研究公司使用的工作硬件上的调试必须归结为非常奇怪的事情。