windbg调试一个CPU100%的.net进程

公司有个服务偶发CPU100%问题,某天正好工作时间爆出,运维就把内存dump了下来,因此得以分析这个神奇事故。dump文件有38G,本来以为我一定要个64GB的内存才能开这dump文件,然而windbg/vs神奇的并不需要这样,于是我们得以简单的用开发机调试这个文件。

首先同前文一样把符号文件全加载进来。发现当前进程有50个线程,图中第一个线程的堆栈说明他执行了ServiceBase.Run(ServicesToRun),线程在wait状态。

50个线程就能让8核16线程的服务器把cpu跑满说明肯定有哪些线程死锁住了。这时就有神奇的 !runaway 命令查出线程的运行时间。

我们还有更强的的 !runaway 7

0:000> !runaway 7
 User Mode Time
  Thread       Time
  10:1348      7 days 1:42:58.088
   9:14b4      7 days 0:46:16.252
  37:89c       0 days 2:58:47.190
  36:2240      0 days 2:58:20.888
  38:678       0 days 2:57:42.715
  35:1fb0      0 days 2:57:36.022
  34:20f8      0 days 2:55:21.097
  33:111c      0 days 2:55:18.632
  32:fe4       0 days 2:54:04.875
  31:17ac      0 days 2:53:51.662
  30:4d8       0 days 2:53:24.798
  29:1960      0 days 2:53:01.757
  28:d6c       0 days 2:52:35.907
  27:1b90      0 days 2:52:28.326
  26:21ec      0 days 2:51:53.678
  25:1840      0 days 2:51:05.692
  24:1744      0 days 2:48:14.107
  23:1fd0      0 days 2:44:59.854
  22:20b8      0 days 2:38:26.997
  21:1d3c      0 days 2:33:27.693
  20:16b4      0 days 2:28:57.921
  19:f50       0 days 2:25:40.642
  18:1f70      0 days 2:21:12.461
  17:c2c       0 days 2:19:18.315
  16:1ee0      0 days 2:18:34.712
  12:21a8      0 days 2:18:19.705
  15:2300      0 days 2:17:47.631
  14:750       0 days 2:17:33.154
  13:21b4      0 days 2:16:35.730
  11:290       0 days 2:15:20.788
   7:1ff4      0 days 0:40:45.736
   6:1f30      0 days 0:03:49.539
   2:207c      0 days 0:00:40.997
  39:141c      0 days 0:00:39.015
  47:e8c       0 days 0:00:37.908
  41:2064      0 days 0:00:22.854
   5:1024      0 days 0:00:09.094
  44:1a04      0 days 0:00:05.787
  43:1f90      0 days 0:00:05.210
  46:1278      0 days 0:00:03.759
  45:1d48      0 days 0:00:03.588
  49:1570      0 days 0:00:01.716
  48:18f0      0 days 0:00:01.123
   8:6d4       0 days 0:00:00.093
   4:164c      0 days 0:00:00.031
   0:c68       0 days 0:00:00.015
  42:1bb8      0 days 0:00:00.000
  40:1ea8      0 days 0:00:00.000
   3:1628      0 days 0:00:00.000
   1:1ed8      0 days 0:00:00.000
 Kernel Mode Time
  Thread       Time
  10:1348      0 days 7:58:29.737
   9:14b4      0 days 0:54:50.107
  20:16b4      0 days 0:05:09.069
  22:20b8      0 days 0:04:46.308
  21:1d3c      0 days 0:04:18.197
  19:f50       0 days 0:03:36.841
  18:1f70      0 days 0:03:36.513
  23:1fd0      0 days 0:02:47.966
  24:1744      0 days 0:02:29.402
  34:20f8      0 days 0:01:57.094
  17:c2c       0 days 0:01:41.166
   7:1ff4      0 days 0:01:33.787
  32:fe4       0 days 0:01:33.382
  37:89c       0 days 0:00:58.328
  36:2240      0 days 0:00:54.319
  35:1fb0      0 days 0:00:44.881
  14:750       0 days 0:00:44.382
  16:1ee0      0 days 0:00:42.385
  38:678       0 days 0:00:40.794
  33:111c      0 days 0:00:39.047
  15:2300      0 days 0:00:38.017
  12:21a8      0 days 0:00:34.663
  26:21ec      0 days 0:00:33.212
  30:4d8       0 days 0:00:31.465
  11:290       0 days 0:00:30.420
  25:1840      0 days 0:00:27.268
  29:1960      0 days 0:00:21.762
  27:1b90      0 days 0:00:19.468
  13:21b4      0 days 0:00:16.411
   6:1f30      0 days 0:00:15.678
  28:d6c       0 days 0:00:14.898
  31:17ac      0 days 0:00:13.041
   2:207c      0 days 0:00:05.974
   5:1024      0 days 0:00:01.778
  39:141c      0 days 0:00:00.982
  41:2064      0 days 0:00:00.483
  44:1a04      0 days 0:00:00.109
  46:1278      0 days 0:00:00.046
  45:1d48      0 days 0:00:00.031
  43:1f90      0 days 0:00:00.031
  48:18f0      0 days 0:00:00.015
  47:e8c       0 days 0:00:00.015
   8:6d4       0 days 0:00:00.015
  49:1570      0 days 0:00:00.000
  42:1bb8      0 days 0:00:00.000
  40:1ea8      0 days 0:00:00.000
   4:164c      0 days 0:00:00.000
   3:1628      0 days 0:00:00.000
   1:1ed8      0 days 0:00:00.000
   0:c68       0 days 0:00:00.000
 Elapsed Time
  Thread       Time
   3:1628      17 days 6:04:49.400
   2:207c      17 days 6:04:49.400
   1:1ed8      17 days 6:04:49.400
   0:c68       17 days 6:04:49.400
   4:164c      17 days 6:04:49.384
   5:1024      17 days 6:04:49.072
   6:1f30      17 days 6:04:48.682
   7:1ff4      17 days 6:04:48.651
   8:6d4       17 days 6:04:48.589
   9:14b4      7 days 12:14:33.593
  10:1348      7 days 11:59:45.561
  11:290       0 days 3:46:11.525
  12:21a8      0 days 3:14:44.234
  13:21b4      0 days 3:14:07.683
  14:750       0 days 3:06:01.040
  15:2300      0 days 3:05:54.862
  16:1ee0      0 days 3:05:52.803
  17:c2c       0 days 3:05:47.655
  18:1f70      0 days 3:05:22.430
  19:f50       0 days 3:05:13.678
  20:16b4      0 days 3:05:11.619
  21:1d3c      0 days 3:05:06.471
  22:20b8      0 days 3:05:04.927
  23:1fd0      0 days 3:04:55.660
  24:1744      0 days 3:04:54.631
  25:1840      0 days 3:04:47.423
  26:21ec      0 days 3:04:44.335
  27:1b90      0 days 3:04:43.305
  28:d6c       0 days 3:04:42.790
  29:1960      0 days 3:04:42.275
  30:4d8       0 days 3:04:28.376
  31:17ac      0 days 3:04:22.713
  32:fe4       0 days 3:04:18.080
  33:111c      0 days 3:04:16.520
  34:20f8      0 days 3:03:55.912
  35:1fb0      0 days 3:03:40.983
  36:2240      0 days 3:03:40.468
  37:89c       0 days 3:03:36.350
  38:678       0 days 3:03:00.267
  39:141c      0 days 1:21:22.949
  40:1ea8      0 days 0:25:36.603
  41:2064      0 days 0:20:15.913
  42:1bb8      0 days 0:05:49.706
  43:1f90      0 days 0:03:59.367
  44:1a04      0 days 0:03:58.852
  45:1d48      0 days 0:02:11.197
  46:1278      0 days 0:02:09.652
  47:e8c       0 days 0:00:45.833
  48:18f0      0 days 0:00:35.366
  49:1570      0 days 0:00:31.653

于是发现这个9和10号进程tmd不仅运行超过了7天,占用cpu也超过了7天,cpu使用率((User Mode Time + Kernel Mode Time)/ Elapsed Time)超过99%(犯人就是你了

在线程子窗口双击9号线程,切换到这个线程,输入 !clrstack 查看托管堆栈。

发现好像卡在这个字典的 insert 上了,又查了几个线程都是卡在这个字典上。

最后查看代码发现这里是一个序列化配置文件数据到内存的方法,每次调用接口时会检查内存中的配置是否过期,过期后会重新读取配置。这里存在一个设计问题和一个代码问题。设计问题是更新配置不应该在调用时同步更新配置,代码问题是更新一个共享的数据时没有加锁。(前人挖的坑真的大

不过一个接口调用卡了7天也是神奇(

这个程序问题是解决了。我们还可以探索一下windbg。

本来想用 kd 和内存查看栈上的数据,但是根本不知道堆栈地址上的这个是啥类型的参数还是函数地址。但是我发现了 !DumpStackObjects(查看clr堆栈对象)和 !DumpObj(查看clr对象)命令。

(!DumpObj 不能直接传堆栈地址,需要对象地址

参考资料:SOS 调试扩展

 

关于如何用 kd 和 memory 查看栈上对象。先用 kd 列出所有堆栈地址,挑选这个 00000000`2228bf00 地址(用 !DumpStackObjects作弊了),可以在 memory 窗口输入这个地址看到这个地址存的数据是 0000008`d42f53e8,内存数据是小端排序,这是一个内存地址,继续跳转到 0000008`d42f53e8 可以看到这大概是一个字符串。

发表回复

您的电子邮箱地址不会被公开。 必填项已用*标注