
Looking for a problem in the wrong place
This is a short story from real practice, when a small problem, well camouflaged by fault tolerance, turns into a headache.
A small branch office, it has its own PBX (asterisk + FreePBX) based on desktop iron and the same local terminal server with 1C, file storage and a virtual RO domain controller. The Internet distributes Mikrotik. The little branch is enough for them.
It all started with monitoring (due to lack of time and laziness, not everything monitors), which reported the overheating of one server (from the PBX) in the branch. While the locals were solving the problem, the old man crashed and broke a bit of the MySQL database.
It doesn’t matter, the base was repaired, everything should work. But the locals complain, the calls are breaking. Okay - there are problems in FreePBX, I’ll take a backup, deploy it, everything’s OK.
But the trouble is in place, the locals are still complaining, calls do not go well. Before them, the call passes normally, but when they themselves call, or call each other, a delay of several seconds is obtained. I start to look at voluminous and obscure logs of Asterisk and FreePBX, they can’t discern the problem. I recall there was a problem with STUN and ICE, which gave a similar delay. I turn it off to hell, the result is zero.
I get discouraged, picking up the PBX for many hours does not lead to anything good, it's already late at night, but the problem is not solved.
He left the problem until morning, hoping for a fresh head. In the morning, another unsuccessful decision was made: since the system had broken down (although it could not be so destructive), I was trying to fix the system by reinstalling all the packages. The result is slightly more than zero, the delay was reduced (not significant, but already a success).
I make one more bad decision: if partial repair of the OS (and the databases from the backup) were of little success, and the root of the problem is still not clear, and at the same time a lot of time has already been spent finding the cause, then I decide to act radically: we take down the OS and we roll everything from scratch (the benefit of the automation of the process does this in an acceptable time). I roll the FreePBX configuration from the copy. Another failure. The result is zero!
I am falling into despair. Very bad thoughts begin to come, I think: maybe the conf in the backup is a curve (I had it after a number of updates that it didn’t work after them, and I couldn’t find the reason), nothing remains: you need to roll everything from scratch with your hands. What a disgrace! The result is strictly zero, and even spent a lot of time!
In desperate attempts to understand what is happening, I begin to carefully study the logs. I notice a pattern. Extension calls in exactly 5 seconds, and for a group of calls from 3 Extension in 15! I start to google about call delay, but already indicating a specific delay. And I come across an answer I already found, people say that the problem is in the DNS, but I know for sure, there is no problem, all addresses are resolved!
Nothing to do, pick up nslookup and bingo (I wish I could do it right away!) The primary DNS lies (virtualka with the controller), but I did not notice! There would be one DNS, there would immediately be an error;)
An elementary problem that monitoring could see (it should still be configured for all nodes), masked by DNS resiliency, led to the loss of almost two working days to solve the stupid situation. Too lazy all the smut, set up monitoring a minute - look for a problem where it does not exist - two days.
Small disposition
A small branch office, it has its own PBX (asterisk + FreePBX) based on desktop iron and the same local terminal server with 1C, file storage and a virtual RO domain controller. The Internet distributes Mikrotik. The little branch is enough for them.
It all started with monitoring (due to lack of time and laziness, not everything monitors), which reported the overheating of one server (from the PBX) in the branch. While the locals were solving the problem, the old man crashed and broke a bit of the MySQL database.
Much portended trouble, but not this ...
It doesn’t matter, the base was repaired, everything should work. But the locals complain, the calls are breaking. Okay - there are problems in FreePBX, I’ll take a backup, deploy it, everything’s OK.
But the trouble is in place, the locals are still complaining, calls do not go well. Before them, the call passes normally, but when they themselves call, or call each other, a delay of several seconds is obtained. I start to look at voluminous and obscure logs of Asterisk and FreePBX, they can’t discern the problem. I recall there was a problem with STUN and ICE, which gave a similar delay. I turn it off to hell, the result is zero.
Despondency is the way to make bad decisions
I get discouraged, picking up the PBX for many hours does not lead to anything good, it's already late at night, but the problem is not solved.
He left the problem until morning, hoping for a fresh head. In the morning, another unsuccessful decision was made: since the system had broken down (although it could not be so destructive), I was trying to fix the system by reinstalling all the packages. The result is slightly more than zero, the delay was reduced (not significant, but already a success).
I make one more bad decision: if partial repair of the OS (and the databases from the backup) were of little success, and the root of the problem is still not clear, and at the same time a lot of time has already been spent finding the cause, then I decide to act radically: we take down the OS and we roll everything from scratch (the benefit of the automation of the process does this in an acceptable time). I roll the FreePBX configuration from the copy. Another failure. The result is zero!
Despair - the mind is overshadowed, decisions get worse
I am falling into despair. Very bad thoughts begin to come, I think: maybe the conf in the backup is a curve (I had it after a number of updates that it didn’t work after them, and I couldn’t find the reason), nothing remains: you need to roll everything from scratch with your hands. What a disgrace! The result is strictly zero, and even spent a lot of time!
Acceptance is the path to awareness
In desperate attempts to understand what is happening, I begin to carefully study the logs. I notice a pattern. Extension calls in exactly 5 seconds, and for a group of calls from 3 Extension in 15! I start to google about call delay, but already indicating a specific delay. And I come across an answer I already found, people say that the problem is in the DNS, but I know for sure, there is no problem, all addresses are resolved!
The obvious is the incredible
Nothing to do, pick up nslookup and bingo (I wish I could do it right away!) The primary DNS lies (virtualka with the controller), but I did not notice! There would be one DNS, there would immediately be an error;)
Total
An elementary problem that monitoring could see (it should still be configured for all nodes), masked by DNS resiliency, led to the loss of almost two working days to solve the stupid situation. Too lazy all the smut, set up monitoring a minute - look for a problem where it does not exist - two days.
Only registered users can participate in the survey. Please come in.
Has this happened to you?
- 31.6% Yes, very rarely 19
- 46.6% Yes, rarely 28
- 11.6% Yes, often 7
- 3.3% Yes, very often 2
- 0% No, with anyone, but not with me! 0
- 6.6% No, I am infallible! 4