公司有个服务偶发CPU100%问题,某天正好工作时间爆出,运维就把内存dump了下来,因此得以分析这个神奇事故。dump文件有38G,本来以为我一定要个64GB的内存才能开这dump文件,然而windbg/vs神奇的并不需要这样,于是我们得以简单的用开发机调试这个文件。
首先同前文一样把符号文件全加载进来。发现当前进程有50个线程,图中第一个线程的堆栈说明他执行了ServiceBase.Run(ServicesToRun),线程在wait状态。
50个线程就能让8核16线程的服务器把cpu跑满说明肯定有哪些线程死锁住了。这时就有神奇的 !runaway 命令查出线程的运行时间。
我们还有更强的的 !runaway 7
0:000> !runaway 7 User Mode Time Thread Time 10:1348 7 days 1:42:58.088 9:14b4 7 days 0:46:16.252 37:89c 0 days 2:58:47.190 36:2240 0 days 2:58:20.888 38:678 0 days 2:57:42.715 35:1fb0 0 days 2:57:36.022 34:20f8 0 days 2:55:21.097 33:111c 0 days 2:55:18.632 32:fe4 0 days 2:54:04.875 31:17ac 0 days 2:53:51.662 30:4d8 0 days 2:53:24.798 29:1960 0 days 2:53:01.757 28:d6c 0 days 2:52:35.907 27:1b90 0 days 2:52:28.326 26:21ec 0 days 2:51:53.678 25:1840 0 days 2:51:05.692 24:1744 0 days 2:48:14.107 23:1fd0 0 days 2:44:59.854 22:20b8 0 days 2:38:26.997 21:1d3c 0 days 2:33:27.693 20:16b4 0 days 2:28:57.921 19:f50 0 days 2:25:40.642 18:1f70 0 days 2:21:12.461 17:c2c 0 days 2:19:18.315 16:1ee0 0 days 2:18:34.712 12:21a8 0 days 2:18:19.705 15:2300 0 days 2:17:47.631 14:750 0 days 2:17:33.154 13:21b4 0 days 2:16:35.730 11:290 0 days 2:15:20.788 7:1ff4 0 days 0:40:45.736 6:1f30 0 days 0:03:49.539 2:207c 0 days 0:00:40.997 39:141c 0 days 0:00:39.015 47:e8c 0 days 0:00:37.908 41:2064 0 days 0:00:22.854 5:1024 0 days 0:00:09.094 44:1a04 0 days 0:00:05.787 43:1f90 0 days 0:00:05.210 46:1278 0 days 0:00:03.759 45:1d48 0 days 0:00:03.588 49:1570 0 days 0:00:01.716 48:18f0 0 days 0:00:01.123 8:6d4 0 days 0:00:00.093 4:164c 0 days 0:00:00.031 0:c68 0 days 0:00:00.015 42:1bb8 0 days 0:00:00.000 40:1ea8 0 days 0:00:00.000 3:1628 0 days 0:00:00.000 1:1ed8 0 days 0:00:00.000 Kernel Mode Time Thread Time 10:1348 0 days 7:58:29.737 9:14b4 0 days 0:54:50.107 20:16b4 0 days 0:05:09.069 22:20b8 0 days 0:04:46.308 21:1d3c 0 days 0:04:18.197 19:f50 0 days 0:03:36.841 18:1f70 0 days 0:03:36.513 23:1fd0 0 days 0:02:47.966 24:1744 0 days 0:02:29.402 34:20f8 0 days 0:01:57.094 17:c2c 0 days 0:01:41.166 7:1ff4 0 days 0:01:33.787 32:fe4 0 days 0:01:33.382 37:89c 0 days 0:00:58.328 36:2240 0 days 0:00:54.319 35:1fb0 0 days 0:00:44.881 14:750 0 days 0:00:44.382 16:1ee0 0 days 0:00:42.385 38:678 0 days 0:00:40.794 33:111c 0 days 0:00:39.047 15:2300 0 days 0:00:38.017 12:21a8 0 days 0:00:34.663 26:21ec 0 days 0:00:33.212 30:4d8 0 days 0:00:31.465 11:290 0 days 0:00:30.420 25:1840 0 days 0:00:27.268 29:1960 0 days 0:00:21.762 27:1b90 0 days 0:00:19.468 13:21b4 0 days 0:00:16.411 6:1f30 0 days 0:00:15.678 28:d6c 0 days 0:00:14.898 31:17ac 0 days 0:00:13.041 2:207c 0 days 0:00:05.974 5:1024 0 days 0:00:01.778 39:141c 0 days 0:00:00.982 41:2064 0 days 0:00:00.483 44:1a04 0 days 0:00:00.109 46:1278 0 days 0:00:00.046 45:1d48 0 days 0:00:00.031 43:1f90 0 days 0:00:00.031 48:18f0 0 days 0:00:00.015 47:e8c 0 days 0:00:00.015 8:6d4 0 days 0:00:00.015 49:1570 0 days 0:00:00.000 42:1bb8 0 days 0:00:00.000 40:1ea8 0 days 0:00:00.000 4:164c 0 days 0:00:00.000 3:1628 0 days 0:00:00.000 1:1ed8 0 days 0:00:00.000 0:c68 0 days 0:00:00.000 Elapsed Time Thread Time 3:1628 17 days 6:04:49.400 2:207c 17 days 6:04:49.400 1:1ed8 17 days 6:04:49.400 0:c68 17 days 6:04:49.400 4:164c 17 days 6:04:49.384 5:1024 17 days 6:04:49.072 6:1f30 17 days 6:04:48.682 7:1ff4 17 days 6:04:48.651 8:6d4 17 days 6:04:48.589 9:14b4 7 days 12:14:33.593 10:1348 7 days 11:59:45.561 11:290 0 days 3:46:11.525 12:21a8 0 days 3:14:44.234 13:21b4 0 days 3:14:07.683 14:750 0 days 3:06:01.040 15:2300 0 days 3:05:54.862 16:1ee0 0 days 3:05:52.803 17:c2c 0 days 3:05:47.655 18:1f70 0 days 3:05:22.430 19:f50 0 days 3:05:13.678 20:16b4 0 days 3:05:11.619 21:1d3c 0 days 3:05:06.471 22:20b8 0 days 3:05:04.927 23:1fd0 0 days 3:04:55.660 24:1744 0 days 3:04:54.631 25:1840 0 days 3:04:47.423 26:21ec 0 days 3:04:44.335 27:1b90 0 days 3:04:43.305 28:d6c 0 days 3:04:42.790 29:1960 0 days 3:04:42.275 30:4d8 0 days 3:04:28.376 31:17ac 0 days 3:04:22.713 32:fe4 0 days 3:04:18.080 33:111c 0 days 3:04:16.520 34:20f8 0 days 3:03:55.912 35:1fb0 0 days 3:03:40.983 36:2240 0 days 3:03:40.468 37:89c 0 days 3:03:36.350 38:678 0 days 3:03:00.267 39:141c 0 days 1:21:22.949 40:1ea8 0 days 0:25:36.603 41:2064 0 days 0:20:15.913 42:1bb8 0 days 0:05:49.706 43:1f90 0 days 0:03:59.367 44:1a04 0 days 0:03:58.852 45:1d48 0 days 0:02:11.197 46:1278 0 days 0:02:09.652 47:e8c 0 days 0:00:45.833 48:18f0 0 days 0:00:35.366 49:1570 0 days 0:00:31.653
于是发现这个9和10号进程tmd不仅运行超过了7天,占用cpu也超过了7天,cpu使用率((User Mode Time + Kernel Mode Time)/ Elapsed Time)超过99%(犯人就是你了
在线程子窗口双击9号线程,切换到这个线程,输入 !clrstack 查看托管堆栈。
发现好像卡在这个字典的 insert 上了,又查了几个线程都是卡在这个字典上。
最后查看代码发现这里是一个序列化配置文件数据到内存的方法,每次调用接口时会检查内存中的配置是否过期,过期后会重新读取配置。这里存在一个设计问题和一个代码问题。设计问题是更新配置不应该在调用时同步更新配置,代码问题是更新一个共享的数据时没有加锁。(前人挖的坑真的大
不过一个接口调用卡了7天也是神奇(
这个程序问题是解决了。我们还可以探索一下windbg。
本来想用 kd 和内存查看栈上的数据,但是根本不知道堆栈地址上的这个是啥类型的参数还是函数地址。但是我发现了 !DumpStackObjects(查看clr堆栈对象)和 !DumpObj(查看clr对象)命令。
(!DumpObj 不能直接传堆栈地址,需要对象地址
参考资料:SOS 调试扩展
关于如何用 kd 和 memory 查看栈上对象。先用 kd 列出所有堆栈地址,挑选这个 00000000`2228bf00 地址(用 !DumpStackObjects作弊了),可以在 memory 窗口输入这个地址看到这个地址存的数据是 0000008`d42f53e8,内存数据是小端排序,这是一个内存地址,继续跳转到 0000008`d42f53e8 可以看到这大概是一个字符串。