測試台的登入API突然出現失敗,追至上游在Log發現錯誤訊息:

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.)

依經驗,前半段是SQL Server未開機或無法建立網路連線的標準訊息,但後方"because the system lacked sufficient buffer space or because a queue was full."卻很罕見,查到MSND Blog,有兩種可能:OS的TCP Buffer記憶體耗盡,或是TCP的臨時Port號用光了(Windows 2003/XP只有5000個可用)。初步觀察,用netstat檢查當下連線不到20條離上限仍遠,至於TCP Buffer耗盡疑慮則沒找到明確線索證實或排除。找出該API要連線的SQL IP,由其他主機試連OK,因此排除SQL Server端網路出狀況的可能性,聚焦回到API主機端的網路。試著修改API服務加入追蹤Log,再得到不同的誤訊息:

A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - Only one usage of each socket address (protocol/network address/port) is normally permitted.)

由"Only one usage of each socket address (protocol/network address/port) is normally permitted."查到MSDN文件,問題根源再次指向TCP Port號用完,且於API主機試連其他台IIS(telnet xxx.xxx.xxx.xxx 80)也出現無法建立連線錯誤,確認問題出在API主機無法建立對外連線(但可以順利透過VNC連上它遠端遙控偵錯,顯示對內連線未受影響),反覆測試telent偶爾成功,多數失敗,推測可能剩餘TCP Port號或Buffer資源有限,故成功靠運氣,但netstat並沒看到成千條並存連線,缺乏佐證。

問題在重新開機後消失。重開機時TCP等資源會全部重設,Port或Buffer不足狀況消失,一切恢復正常很合理。但先前未由netstat印證資源被大量連線吃光,真相成謎,先歸入X檔案~


Comments

# by Kz

關注+1。 去年遇過幾乎一樣的情況,最後解也是重啟AP主機,歸入X檔案~。 對方系統人員曾說重啟前netstat有見到千萬條TCP TIMEWAIT,但無留存畫面。(........) 重啟後netstat正常如昔.. ( ' Д ' )

# by Jeffrey

to Kz, 使用SQL超過十年第一次遇到,本以為又是冷到極點的茶包,沒想到讀者裡就有人經歷過…(握手) 如果netstat有成千上萬條TCP TIMEWAIT,可檢查都連到哪一台主機,再回頭探索相關程式及服務是否有Bug或故障,方向會明確很多,系統管理人員沒有保存線索實在太可惜了~ (送他整組手套、鑷子跟證物袋,提醒他下回要蒐證好了 :P)

# by Kz

我又遇到相同狀況啦啦啦~ 這次是不同客戶,請對方留存當下netstat畫面, 發現是一大串的本機Port連對SQL伺服器,然後處於TIME_WAIT狀態。多大串?netstat列不完的那麼大串.. 這次當然也是重啟後就好了... 回頭再Google了一下,微軟有一個HotFix說的跟這狀態很像.. https://support.microsoft.com/zh-tw/kb/2553549 (不會吧..( - Д - #))

# by Jeffrey

to Kz, 了不起,能遇到這麼多次,這運氣移到簽樂透多好? MS的那篇Hotfix需要超過一年未重新開機,實務上很罕見(High Availability指數破百 XD),不知你的案例是否真為此狀況?

# by Kz

經查詢客戶伺服器的系統事件紀錄,發現狀況發生當天距離前一次重開機日期有499天之久(請SQL DATEDIFF幫我算的,不曉得是不是要掐頭去尾,哈哈) 所以此案應該算解了(吧)....

# by Jeffrey

to Kz, Wow~ 499天,Cool! 原來真有超過一年不重開的Windows,長見識了,感謝你的實戰分享,哈!

Post a comment