How to work with connection timeouts: Ticketmaster case
- Transfer
We constantly share our experience in optimizing the service systems of our IaaS provider:
Today, our attention was drawn to the Ticketmaster case . Let's try to analyze it briefly. / Photo by Ginny / CC Audyn Espinoza says that the Ticketmaster team is constantly monitoring. The IT department of the project likes to optimize the service, but it is necessary to solve IT problems as they appear. In this case, it all started with the fact that one of the requests received an unusually high number of timeouts. Turning to the monitoring logs showed that the problem is typical for the entire cluster as a whole. Timeouts here were observed every minute.
An additional assessment of the situation using tcpdump showed that the problem can be localized at the stage of passing through the firewall. For a more detailed investigation, it was decided to use OPNET and Wireshark.
These tools showed that the SYN packet passes without problems, but his friend SYN / ACK fails to do this. When the test packet was sent in the opposite direction, the result was similar.
As a result, the team returned to reviewing the work of the firewall. They found that when the TCP downtime was reached, the SYN packet was retransmitted, and the SYN / ACK never passed through the virtual firewall.
Final Verdict: SNMP monopolized the CPU every 60 seconds, which is what firewall monitoring showed. It was a bug at the code level. To solve the problem, the team disabled the SNMP polling system.
PS A little about the work of our IaaS provider:
Today, our attention was drawn to the Ticketmaster case . Let's try to analyze it briefly. / Photo by Ginny / CC Audyn Espinoza says that the Ticketmaster team is constantly monitoring. The IT department of the project likes to optimize the service, but it is necessary to solve IT problems as they appear. In this case, it all started with the fact that one of the requests received an unusually high number of timeouts. Turning to the monitoring logs showed that the problem is typical for the entire cluster as a whole. Timeouts here were observed every minute.
An additional assessment of the situation using tcpdump showed that the problem can be localized at the stage of passing through the firewall. For a more detailed investigation, it was decided to use OPNET and Wireshark.
These tools showed that the SYN packet passes without problems, but his friend SYN / ACK fails to do this. When the test packet was sent in the opposite direction, the result was similar.
As a result, the team returned to reviewing the work of the firewall. They found that when the TCP downtime was reached, the SYN packet was retransmitted, and the SYN / ACK never passed through the virtual firewall.
Final Verdict: SNMP monopolized the CPU every 60 seconds, which is what firewall monitoring showed. It was a bug at the code level. To solve the problem, the team disabled the SNMP polling system.
PS A little about the work of our IaaS provider: