We recently started getting errors from SAM/NPM where it'd report a bunch of nodes "down" - because it couldn't ping them. Some sort of network congestion (that was perfectly healthy for years before last week): high response time on multiple nodes, high number of errors, some applications going down as well. Would appreciate if you could share tips and best practices, how to troubleshoot these.
Image may be NSFW.
Clik here to view.
Clik here to view.
