一:背景
1. 讲故事
这个月初,有位敌人wx上找到我,说他的api过一段时间后,就会呈现只有申请,没有响应的状况,截图如下:
从敌人的形容中看样子程序是被什么货色卡住了,这种卡死的问题解决起来绝对简略,接下来我就用 windbg 给大家剖析一下。
二:Windbg 剖析
1. Request 申请正在干嘛?
既然敌人说 api 有 request 无 response,那怎么去验证敌人的话对不对呢? 咱们都晓得 .NET 用 HttpContext
来示意一个申请,话中有话就是能够去抓 HttpContext 下的时长属性,Netext 中有一个 !whttp
命令能够帮忙咱们。
0:000> !whttpHttpContext Thread Time Out Running Status Verb Url000000563bf803b0 42 00:01:50 00:01:24 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN000000563bf84660 -- 00:01:50 Finished 200 GET http://xxx.com:30003/000000563c4a0470 51 00:01:50 00:00:12 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C00000056bbf63590 30 00:01:50 00:02:41 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C00000056bc82a038 -- 00:01:50 Finished 200 GET http://localhost:30003/00000056bc84a3e8 44 00:01:50 00:00:51 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x00000056bc8671c8 46 00:01:50 00:00:45 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C000000573bf44698 35 00:01:50 00:02:39 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x000000573bf483c0 33 00:01:50 00:02:41 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x-HN000000573bf97e80 40 00:01:50 00:02:32 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=ZJB2C000000573c583b08 -- 00:01:50 Finished 200 GET http://localhost:30003/000000573c589ec8 -- 00:01:50 Finished 200 GET http://xxx.com:30003/Wms/xxx/xxx/xxx000000573c760e28 -- 00:01:50 Finished 200 POST http://xxx.com:30003/Wms/xxx/xxx/xxx000000573c95f990 48 00:01:50 00:00:31 200 POST http://xxx.com:30003/Wms/Common/xxx?xxx=xxx&xxx=x-HN00000057bbf4f8e8 31 00:01:50 00:02:12 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x00000057bc080340 50 00:01:50 00:00:19 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x000000583c4aee80 43 00:01:50 00:01:11 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B000000583c4d0c50 53 00:01:50 00:00:01 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B00000058bbf8f1a0 34 00:01:50 00:02:22 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2B000000593bfe1758 41 00:01:50 00:01:22 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx2C000000593c892160 -- 00:01:50 Finished 200 GET http://xxx.com:30003/Wms/xxx/xxx/xxxJob000000593ca813b0 45 00:01:50 00:00:30 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-HN000000593caa45d8 -- 00:01:50 Finished 200 GET http://xxx.com:30003/00000059bc1ad808 32 00:01:50 00:01:45 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=xxx-B2C00000059bc1c3d70 36 00:01:50 00:01:29 200 POST http://xxx.com:30003/Wms/xxx/xxx?xxx=xxx&xxx=x25 HttpContext object(s) found matching criteria
从 Running
列能够看到大多申请都曾经达到1分钟以上,这也验证了敌人所说的卡死问题,依照教训,能够取 Running
列中最大的 httpContext 所在的线程,也就是下面的 30
和 33
号线程, 看看它们都在干什么?
2. 探索 Running 最长线程
接下来切到 30
和 33
号线程,看看它们的线程栈。
0:000> ~30sntdll!NtWaitForSingleObject+0xa:00007ffd`b81f024a c3 ret0:030> !clrstack OS Thread Id: 0x29d0 (30) Child SP IP Call Site0000005acc3ac590 00007ffdb81f024a [PrestubMethodFrame: 0000005acc3ac590] xxx.xxx.RedisConnectionHelp.get_Instance()0000005acc3ac850 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)0000005acc3ac8c0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)0000005acc3ac968 00007ffdabef1f7c [StubHelperFrame: 0000005acc3ac968] 0000005acc3ac9c0 00007ffd4dd77f18 xxx.xxx.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)...0:030> ~33sntdll!NtWaitForMultipleObjects+0xa:00007ffd`b81f07ba c3 ret0:033> !clrstack OS Thread Id: 0x3ad4 (33) Child SP IP Call Site0000005accabae90 00007ffdb81f07ba [GCFrame: 0000005accabae90] 0000005accabafb8 00007ffdb81f07ba [HelperMethodFrame_1OBJ: 0000005accabafb8] System.Threading.Monitor.ObjWait(Boolean, Int32, System.Object)0000005accabb0d0 00007ffdaac60d64 System.Threading.ManualResetEventSlim.Wait(Int32, System.Threading.CancellationToken)0000005accabb160 00007ffdaac5b4bb System.Threading.Tasks.Task.SpinThenBlockingWait(Int32, System.Threading.CancellationToken)0000005accabb1d0 00007ffdab5a01d1 System.Threading.Tasks.Task.InternalWait(Int32, System.Threading.CancellationToken)0000005accabb2a0 00007ffdab59cfa7 System.Threading.Tasks.Task`1[[System.__Canon, mscorlib]].GetResultxxx(Boolean)0000005accabb2e0 00007ffd4d8d338f xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].GetConfig(xxx.Config.Model.ConfigListener, System.Func`2<xxx.Config.Request.GetConfigRequest,System.Threading.Tasks.Task`1<System.String>>)0000005accabb340 00007ffd4d8d2f40 xxx.Config.xxx.Config`1[[System.__Canon, mscorlib]].get_Item(System.String, System.String)0000005accabb3c0 00007ffd4dd78f7f xxx.Util.BaseConfig.get_GetRedisConn()0000005accabb440 00007ffd4dd78e9c xxx.xxx.RedisConnectionHelp.GetConnectionString()0000005accabb4a0 00007ffd4dd789cb xxx.xxx.RedisConnectionHelp..cctor()0000005accabb940 00007ffdabef6953 [GCFrame: 0000005accabb940] 0000005accabc5b0 00007ffdabef6953 [PrestubMethodFrame: 0000005accabc5b0] xxx.xxx.RedisConnectionHelp.get_Instance()0000005accabc870 00007ffd4dd78911 xxx.xxx.RedisCache..ctor(Int32, System.String)0000005accabc8e0 00007ffd4dd78038 xxx.xxx.CacheByRedis.HashGet[[System.__Canon, mscorlib]](System.String, System.String, Int32)0000005accabc988 00007ffdabef1f7c [StubHelperFrame: 0000005accabc988] 0000005accabc9e0 00007ffd4dd77f18 xxx.Core.Cache.xxx.GetCacheNotAreaDataEntity[[System.__Canon, mscorlib]](System.String, System.String, System.String)...
下面的信息不难发现 30
号线程正卡在 RedisConnectionHelp.get_Instance()
处,33
号线曾经进入了 RedisConnectionHelp.get_Instance()
办法中,最初在 GetConfig()
处期待 Result 的后果,按教训来说,30
号线程看样子正在锁期待, 33
号正在期待异步后果,接下来的突破点就是探索下 RedisConnectionHelp.Instance
处代码。
3. 寻找问题代码
接下来用反编译工具 ILSpy 找到问题代码。
public static class RedisConnectionHelp{ public static ConnectionMultiplexer Instance { get { if (_instance == null) { lock (Locker) { if (_instance == null || !_instance.IsConnected) { _instance = GetManager(); } } } return _instance; } }}
30 号线程果然是卡在 Locker
处,接下来深挖下 33 号线程所执行的 GetManager()
办法,简化后代码如下:
public T this[string dataId, string key = ""]{ get { try { string config = GetConfig(configListener, new NacosConfigClient(Base.NacosConfiguration).DoGetConfigAsync); return JsonConvert.DeserializeObject<T>(config); } catch (Exception ex) { return default(T); } }}private string GetConfig(ConfigListener listener, Func<GetConfigRequest, Task<string>> action){ var text2 = action(new GetConfigRequest { DataId = listener.DataId, Group = listener.Group, Tenant = text }).Result; return text2;}internal async Task<string> DoGetConfigAsync(GetConfigRequest request){ IRestResponse restResponse = await HttpUtil.Request(currentServerAddr, Method.GET, request.ParamValues(), xxx); return restResponse.Content;}
能够看到代码卡在了 Result
上无限期期待,到这里我就想到了 同步上下文
,我看他这个程序是 .NET 4.8
下的 ASP.NET MVC
,记得不错上下文应该是 AspNetSynchronizationContext
,具体死锁起因可参见我的这篇文章:一句 Task.Result 就死锁, 这代码还怎么写?,解决办法大略有四种。
- 应用 .ConfigureAwait(false)
- 改成全异步
- 用 Task 再包一层。
- 改成全同步
三:总结
其实本次事变次要还是因为在 同步代码
中做了 异步代码.Result
导致的死锁问题,有十分多的文章都在鞭挞这种景象,在 asp.net core 中曾经移除了这种同步上下文
的大坑,给到敌人的倡议是改成全同步,死锁问题也随之隐没。
哈哈,真替敌人开心!