Quantcast

crashing in 2.2.2

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

crashing in 2.2.2

Richard Robson
I've revisited the gateway failover mechanism I had developed in order
to re route calls to the next gateway on 500's due to capacity on the
gateways we are using.

we have 3 gateways from one carrier and one from another. The 3 have 4
cps and will return a 503 or 500 if we breach this. The single gateway
from the other carrier has plenty of capacity and should not be a
problem so we want to catch this . and route to the next gateway.

We are counting the CPS and channel limits and are routing to the next
gateway if we exceed the limit set, but There are still occasions where
a 5XX is generated, which results in a rejected call.


We want to stop these rejected calls and therefore want to implement the
failover mechanism for the 5XX responses. For 6 months we have been
failing over if we think the counts are to high on any one gateway
without a problem. But when I implement a failover on a 5XX response
opensips starts crashing.

It's difficult to generate artificial traffic to mimic the real traffic,
but I've not had a problem with the script in testing. Last night I
rolled out the new script but by 09:15 this morning opensips started
crashing 10 times in 5 minutes. This was as the traffic ramped up. I
rolled back the script and it restarted OK and has not crashed since.
Therefore the Failover Mechanism in the script is where the crash is
happening

Core dump: http://pastebin.com/CqnESCm4

I'll add more dumps later

Regards,

Richard


this is the failure route catching the 5XX

failure_route[dr_fo] {
         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn, $ft,
$fu, $fU, $si, $sp, To: $ru");
         if (t_was_cancelled()) {
                 xlog("[dr]call cancelled by internal caller");
                 rtpengine_manage();
                 do_accounting("db", "cdr|missed");
                 exit;
         }

         if ( t_check_status("[54]03")) {
                 route(relay_failover);
         }
         if ( t_check_status("500")) {
                 route(relay_failover);
         }

         do_accounting("db", "cdr|missed");
         rtpengine_manage();
         exit;
}

This is the route taken on the failure


route[relay_failover]{

         if (use_next_gw()) {
                 xlog("[relay_failover-route] Selected Gateway is $rd");
                 $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
$avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});

                 ####### check channel limit ######
                 get_profile_size("outbound","$rd","$var(size)");
                 xlog("[relay_failover-route] Selected Gateway is $rd
var(size) = $var(size)");
                 xlog("[relay_failover-route] Selected Gateway is $rd
avp(trunkcalllimit) = $avp(trunkchannellimit)");
                 xlog("[relay_failover-route] Selected Gateway is $rd  
result = ( $var(size) > $avp(trunkchannellimit))");
                 if ( $(var(size){s.int}) >
$(avp(trunkchannellimit){s.int})) {
                         xlog("[relay_failover-route] Trunk $rd exceeded
$avp(trunkchannellimit) concurrent calls $var(size)");
                         route(relay_failover);
                 }
         } else {
                send_reply("503", "Gateways Exhusted");
                exit;
         }

         ##### We need to check Rate Limiting #######
         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
"TAILDROP")) { # Check Rate limit $avp needs changing
                 rl_dec_count("$rd"); # decrement the counter since
we've not "used" one
                 xlog("[ratelimiter-route] [Max CPS:
$(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call to: $rU
from: $fU CPS exceeded, delaying");
                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
                 async(usleep("200000"),relay_failover_delay);
                 xlog ("Should not get here!!!! after async requst");
         } else {
                 xlog ("[relay_outbound-route] [Max CPS:
$avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU from: $fU
not ratelimited");
         }

         t_on_failure("dr_fo");
         do_accounting("db", "cdr|missed");
         rtpengine_manage();
         if (!t_relay()) {
                         xlog("[relay-route] ERROR: Unable to relay");
                         send_reply("500","Internal Error");
                         exit;
         }
}




--
Richard Robson
Greenlight Support
01382 843843
[hidden email]


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: crashing in 2.2.2

Richard Robson
More cores

http://pastebin.com/MXW2VBhi
http://pastebin.com/T7JFAP2U
http://pastebin.com/u44aaVpWquit
http://pastebin.com/SFKKcGxE
http://pastebin.com/dwSgMsJi
http://pastebin.com/9HdGLm96

I've put 2.2.3 on the dev box now and will try to replicate on that box,
but its difficult to replicate the traffic artificially. I'll try to
replicate the fault on the dev box over the weekend. I cant do it on the
live gateways because it will affect customer traffic.

Regards,

Richard


On 03/03/2017 11:28, Richard Robson wrote:

> I've revisited the gateway failover mechanism I had developed in order
> to re route calls to the next gateway on 500's due to capacity on the
> gateways we are using.
>
> we have 3 gateways from one carrier and one from another. The 3 have 4
> cps and will return a 503 or 500 if we breach this. The single gateway
> from the other carrier has plenty of capacity and should not be a
> problem so we want to catch this . and route to the next gateway.
>
> We are counting the CPS and channel limits and are routing to the next
> gateway if we exceed the limit set, but There are still occasions
> where a 5XX is generated, which results in a rejected call.
>
>
> We want to stop these rejected calls and therefore want to implement
> the failover mechanism for the 5XX responses. For 6 months we have
> been failing over if we think the counts are to high on any one
> gateway without a problem. But when I implement a failover on a 5XX
> response opensips starts crashing.
>
> It's difficult to generate artificial traffic to mimic the real
> traffic, but I've not had a problem with the script in testing. Last
> night I rolled out the new script but by 09:15 this morning opensips
> started crashing 10 times in 5 minutes. This was as the traffic ramped
> up. I rolled back the script and it restarted OK and has not crashed
> since. Therefore the Failover Mechanism in the script is where the
> crash is happening
>
> Core dump: http://pastebin.com/CqnESCm4
>
> I'll add more dumps later
>
> Regards,
>
> Richard
>
>
> this is the failure route catching the 5XX
>
> failure_route[dr_fo] {
>         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn,
> $ft, $fu, $fU, $si, $sp, To: $ru");
>         if (t_was_cancelled()) {
>                 xlog("[dr]call cancelled by internal caller");
>                 rtpengine_manage();
>                 do_accounting("db", "cdr|missed");
>                 exit;
>         }
>
>         if ( t_check_status("[54]03")) {
>                 route(relay_failover);
>         }
>         if ( t_check_status("500")) {
>                 route(relay_failover);
>         }
>
>         do_accounting("db", "cdr|missed");
>         rtpengine_manage();
>         exit;
> }
>
> This is the route taken on the failure
>
>
> route[relay_failover]{
>
>         if (use_next_gw()) {
>                 xlog("[relay_failover-route] Selected Gateway is $rd");
>                 $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>
>                 ####### check channel limit ######
>                 get_profile_size("outbound","$rd","$var(size)");
>                 xlog("[relay_failover-route] Selected Gateway is $rd
> var(size) = $var(size)");
>                 xlog("[relay_failover-route] Selected Gateway is $rd
> avp(trunkcalllimit) = $avp(trunkchannellimit)");
>                 xlog("[relay_failover-route] Selected Gateway is $rd  
> result = ( $var(size) > $avp(trunkchannellimit))");
>                 if ( $(var(size){s.int}) >
> $(avp(trunkchannellimit){s.int})) {
>                         xlog("[relay_failover-route] Trunk $rd
> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>                         route(relay_failover);
>                 }
>         } else {
>                send_reply("503", "Gateways Exhusted");
>                exit;
>         }
>
>         ##### We need to check Rate Limiting #######
>         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
> "TAILDROP")) { # Check Rate limit $avp needs changing
>                 rl_dec_count("$rd"); # decrement the counter since
> we've not "used" one
>                 xlog("[ratelimiter-route] [Max CPS:
> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call to:
> $rU from: $fU CPS exceeded, delaying");
>                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>                 async(usleep("200000"),relay_failover_delay);
>                 xlog ("Should not get here!!!! after async requst");
>         } else {
>                 xlog ("[relay_outbound-route] [Max CPS:
> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU from:
> $fU not ratelimited");
>         }
>
>         t_on_failure("dr_fo");
>         do_accounting("db", "cdr|missed");
>         rtpengine_manage();
>         if (!t_relay()) {
>                         xlog("[relay-route] ERROR: Unable to relay");
>                         send_reply("500","Internal Error");
>                         exit;
>         }
> }
>
>
>
>


--
Richard Robson
Greenlight Support
01382 843843
[hidden email]


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: crashing in 2.2.2

Richard Robson
Hi<

I've tested this on the latest 2.2.3 with the same results.

http://pastebin.com/Uixb3v8G

there were a few of these in the logsd too just before the crash:
Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
for 204079170 ms (now 204079270 ms), it may overlap..
Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
for 204079170 ms (now 204079360 ms), it may overlap..
Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
for 204079170 ms (now 204079460 ms), it may overlap..
Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
for 204079170 ms (now 204079560 ms), it may overlap..
Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
for 204079170 ms (now 204079660 ms), it may overlap..
Mar  5 22:02:28 gl-sip-03 /usr/sbin/opensips[29875]:
WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
for 204079170 ms (now 204079760 ms), it may overlap..


Regards,

Richard



On 03/03/2017 13:15, Richard Robson wrote:

> More cores
>
> http://pastebin.com/MXW2VBhi
> http://pastebin.com/T7JFAP2U
> http://pastebin.com/u44aaVpWquit
> http://pastebin.com/SFKKcGxE
> http://pastebin.com/dwSgMsJi
> http://pastebin.com/9HdGLm96
>
> I've put 2.2.3 on the dev box now and will try to replicate on that
> box, but its difficult to replicate the traffic artificially. I'll try
> to replicate the fault on the dev box over the weekend. I cant do it
> on the live gateways because it will affect customer traffic.
>
> Regards,
>
> Richard
>
>
> On 03/03/2017 11:28, Richard Robson wrote:
>> I've revisited the gateway failover mechanism I had developed in
>> order to re route calls to the next gateway on 500's due to capacity
>> on the gateways we are using.
>>
>> we have 3 gateways from one carrier and one from another. The 3 have
>> 4 cps and will return a 503 or 500 if we breach this. The single
>> gateway from the other carrier has plenty of capacity and should not
>> be a problem so we want to catch this . and route to the next gateway.
>>
>> We are counting the CPS and channel limits and are routing to the
>> next gateway if we exceed the limit set, but There are still
>> occasions where a 5XX is generated, which results in a rejected call.
>>
>>
>> We want to stop these rejected calls and therefore want to implement
>> the failover mechanism for the 5XX responses. For 6 months we have
>> been failing over if we think the counts are to high on any one
>> gateway without a problem. But when I implement a failover on a 5XX
>> response opensips starts crashing.
>>
>> It's difficult to generate artificial traffic to mimic the real
>> traffic, but I've not had a problem with the script in testing. Last
>> night I rolled out the new script but by 09:15 this morning opensips
>> started crashing 10 times in 5 minutes. This was as the traffic
>> ramped up. I rolled back the script and it restarted OK and has not
>> crashed since. Therefore the Failover Mechanism in the script is
>> where the crash is happening
>>
>> Core dump: http://pastebin.com/CqnESCm4
>>
>> I'll add more dumps later
>>
>> Regards,
>>
>> Richard
>>
>>
>> this is the failure route catching the 5XX
>>
>> failure_route[dr_fo] {
>>         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn,
>> $ft, $fu, $fU, $si, $sp, To: $ru");
>>         if (t_was_cancelled()) {
>>                 xlog("[dr]call cancelled by internal caller");
>>                 rtpengine_manage();
>>                 do_accounting("db", "cdr|missed");
>>                 exit;
>>         }
>>
>>         if ( t_check_status("[54]03")) {
>>                 route(relay_failover);
>>         }
>>         if ( t_check_status("500")) {
>>                 route(relay_failover);
>>         }
>>
>>         do_accounting("db", "cdr|missed");
>>         rtpengine_manage();
>>         exit;
>> }
>>
>> This is the route taken on the failure
>>
>>
>> route[relay_failover]{
>>
>>         if (use_next_gw()) {
>>                 xlog("[relay_failover-route] Selected Gateway is $rd");
>> $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
>> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>>
>>                 ####### check channel limit ######
>>                 get_profile_size("outbound","$rd","$var(size)");
>>                 xlog("[relay_failover-route] Selected Gateway is $rd
>> var(size) = $var(size)");
>>                 xlog("[relay_failover-route] Selected Gateway is $rd
>> avp(trunkcalllimit) = $avp(trunkchannellimit)");
>>                 xlog("[relay_failover-route] Selected Gateway is $rd  
>> result = ( $var(size) > $avp(trunkchannellimit))");
>>                 if ( $(var(size){s.int}) >
>> $(avp(trunkchannellimit){s.int})) {
>>                         xlog("[relay_failover-route] Trunk $rd
>> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>>                         route(relay_failover);
>>                 }
>>         } else {
>>                send_reply("503", "Gateways Exhusted");
>>                exit;
>>         }
>>
>>         ##### We need to check Rate Limiting #######
>>         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
>> "TAILDROP")) { # Check Rate limit $avp needs changing
>>                 rl_dec_count("$rd"); # decrement the counter since
>> we've not "used" one
>>                 xlog("[ratelimiter-route] [Max CPS:
>> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call to:
>> $rU from: $fU CPS exceeded, delaying");
>>                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>>                 async(usleep("200000"),relay_failover_delay);
>>                 xlog ("Should not get here!!!! after async requst");
>>         } else {
>>                 xlog ("[relay_outbound-route] [Max CPS:
>> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU from:
>> $fU not ratelimited");
>>         }
>>
>>         t_on_failure("dr_fo");
>>         do_accounting("db", "cdr|missed");
>>         rtpengine_manage();
>>         if (!t_relay()) {
>>                         xlog("[relay-route] ERROR: Unable to relay");
>>                         send_reply("500","Internal Error");
>>                         exit;
>>         }
>> }
>>
>>
>>
>>
>
>


--
Richard Robson
Greenlight Support
01382 843843
[hidden email]


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: crashing in 2.2.2

Richard Robson
Hi,

I've gone over the script and as far as I can see its working as
expected until the traffic remps up and then opensips crashes.

cores:
http://pastebin.com/CgN0h40K
http://pastebin.com/ay5TS8zD
http://pastebin.com/PGn3AqmU

Regards,

Richard

On 06/03/2017 12:14, Richard Robson wrote:

> Hi<
>
> I've tested this on the latest 2.2.3 with the same results.
>
> http://pastebin.com/Uixb3v8G
>
> there were a few of these in the logsd too just before the crash:
> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
> for 204079170 ms (now 204079270 ms), it may overlap..
> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
> for 204079170 ms (now 204079360 ms), it may overlap..
> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
> for 204079170 ms (now 204079460 ms), it may overlap..
> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
> for 204079170 ms (now 204079560 ms), it may overlap..
> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
> for 204079170 ms (now 204079660 ms), it may overlap..
> Mar  5 22:02:28 gl-sip-03 /usr/sbin/opensips[29875]:
> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
> for 204079170 ms (now 204079760 ms), it may overlap..
>
>
> Regards,
>
> Richard
>
>
>
> On 03/03/2017 13:15, Richard Robson wrote:
>> More cores
>>
>> http://pastebin.com/MXW2VBhi
>> http://pastebin.com/T7JFAP2U
>> http://pastebin.com/u44aaVpWquit
>> http://pastebin.com/SFKKcGxE
>> http://pastebin.com/dwSgMsJi
>> http://pastebin.com/9HdGLm96
>>
>> I've put 2.2.3 on the dev box now and will try to replicate on that
>> box, but its difficult to replicate the traffic artificially. I'll
>> try to replicate the fault on the dev box over the weekend. I cant do
>> it on the live gateways because it will affect customer traffic.
>>
>> Regards,
>>
>> Richard
>>
>>
>> On 03/03/2017 11:28, Richard Robson wrote:
>>> I've revisited the gateway failover mechanism I had developed in
>>> order to re route calls to the next gateway on 500's due to capacity
>>> on the gateways we are using.
>>>
>>> we have 3 gateways from one carrier and one from another. The 3 have
>>> 4 cps and will return a 503 or 500 if we breach this. The single
>>> gateway from the other carrier has plenty of capacity and should not
>>> be a problem so we want to catch this . and route to the next gateway.
>>>
>>> We are counting the CPS and channel limits and are routing to the
>>> next gateway if we exceed the limit set, but There are still
>>> occasions where a 5XX is generated, which results in a rejected call.
>>>
>>>
>>> We want to stop these rejected calls and therefore want to implement
>>> the failover mechanism for the 5XX responses. For 6 months we have
>>> been failing over if we think the counts are to high on any one
>>> gateway without a problem. But when I implement a failover on a 5XX
>>> response opensips starts crashing.
>>>
>>> It's difficult to generate artificial traffic to mimic the real
>>> traffic, but I've not had a problem with the script in testing. Last
>>> night I rolled out the new script but by 09:15 this morning opensips
>>> started crashing 10 times in 5 minutes. This was as the traffic
>>> ramped up. I rolled back the script and it restarted OK and has not
>>> crashed since. Therefore the Failover Mechanism in the script is
>>> where the crash is happening
>>>
>>> Core dump: http://pastebin.com/CqnESCm4
>>>
>>> I'll add more dumps later
>>>
>>> Regards,
>>>
>>> Richard
>>>
>>>
>>> this is the failure route catching the 5XX
>>>
>>> failure_route[dr_fo] {
>>>         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn,
>>> $ft, $fu, $fU, $si, $sp, To: $ru");
>>>         if (t_was_cancelled()) {
>>>                 xlog("[dr]call cancelled by internal caller");
>>>                 rtpengine_manage();
>>>                 do_accounting("db", "cdr|missed");
>>>                 exit;
>>>         }
>>>
>>>         if ( t_check_status("[54]03")) {
>>>                 route(relay_failover);
>>>         }
>>>         if ( t_check_status("500")) {
>>>                 route(relay_failover);
>>>         }
>>>
>>>         do_accounting("db", "cdr|missed");
>>>         rtpengine_manage();
>>>         exit;
>>> }
>>>
>>> This is the route taken on the failure
>>>
>>>
>>> route[relay_failover]{
>>>
>>>         if (use_next_gw()) {
>>>                 xlog("[relay_failover-route] Selected Gateway is $rd");
>>> $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
>>> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>>>
>>>                 ####### check channel limit ######
>>> get_profile_size("outbound","$rd","$var(size)");
>>>                 xlog("[relay_failover-route] Selected Gateway is $rd
>>> var(size) = $var(size)");
>>>                 xlog("[relay_failover-route] Selected Gateway is $rd
>>> avp(trunkcalllimit) = $avp(trunkchannellimit)");
>>>                 xlog("[relay_failover-route] Selected Gateway is
>>> $rd  result = ( $var(size) > $avp(trunkchannellimit))");
>>>                 if ( $(var(size){s.int}) >
>>> $(avp(trunkchannellimit){s.int})) {
>>>                         xlog("[relay_failover-route] Trunk $rd
>>> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>>>                         route(relay_failover);
>>>                 }
>>>         } else {
>>>                send_reply("503", "Gateways Exhusted");
>>>                exit;
>>>         }
>>>
>>>         ##### We need to check Rate Limiting #######
>>>         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
>>> "TAILDROP")) { # Check Rate limit $avp needs changing
>>>                 rl_dec_count("$rd"); # decrement the counter since
>>> we've not "used" one
>>>                 xlog("[ratelimiter-route] [Max CPS:
>>> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call to:
>>> $rU from: $fU CPS exceeded, delaying");
>>>                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>>>                 async(usleep("200000"),relay_failover_delay);
>>>                 xlog ("Should not get here!!!! after async requst");
>>>         } else {
>>>                 xlog ("[relay_outbound-route] [Max CPS:
>>> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU from:
>>> $fU not ratelimited");
>>>         }
>>>
>>>         t_on_failure("dr_fo");
>>>         do_accounting("db", "cdr|missed");
>>>         rtpengine_manage();
>>>         if (!t_relay()) {
>>>                         xlog("[relay-route] ERROR: Unable to relay");
>>>                         send_reply("500","Internal Error");
>>>                         exit;
>>>         }
>>> }
>>>
>>>
>>>
>>>
>>
>>
>
>


--
Richard Robson
Greenlight Support
01382 843843
[hidden email]


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: crashing in 2.2.2

Bogdan-Andrei Iancu-2
Hi Richard,

Sorry for the slow reply - is this crash happing only under some ++load
? do you see any errors from OpenSIPS prior to the crash ?

I noticed that the backtraces in the corefiles are similar - how easy is
for you to reproduce it ?

Regards,

Bogdan-Andrei Iancu
   OpenSIPS Founder and Developer
   http://www.opensips-solutions.com

OpenSIPS Summit May 2017 Amsterdam
   http://www.opensips.org/events/Summit-2017Amsterdam.html

On 03/07/2017 12:28 PM, Richard Robson wrote:

> Hi,
>
> I've gone over the script and as far as I can see its working as
> expected until the traffic remps up and then opensips crashes.
>
> cores:
> http://pastebin.com/CgN0h40K
> http://pastebin.com/ay5TS8zD
> http://pastebin.com/PGn3AqmU
>
> Regards,
>
> Richard
>
> On 06/03/2017 12:14, Richard Robson wrote:
>> Hi<
>>
>> I've tested this on the latest 2.2.3 with the same results.
>>
>> http://pastebin.com/Uixb3v8G
>>
>> there were a few of these in the logsd too just before the crash:
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
>> for 204079170 ms (now 204079270 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
>> for 204079170 ms (now 204079360 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
>> for 204079170 ms (now 204079460 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
>> for 204079170 ms (now 204079560 ms), it may overlap..
>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
>> for 204079170 ms (now 204079660 ms), it may overlap..
>> Mar  5 22:02:28 gl-sip-03 /usr/sbin/opensips[29875]:
>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already scheduled
>> for 204079170 ms (now 204079760 ms), it may overlap..
>>
>>
>> Regards,
>>
>> Richard
>>
>>
>>
>> On 03/03/2017 13:15, Richard Robson wrote:
>>> More cores
>>>
>>> http://pastebin.com/MXW2VBhi
>>> http://pastebin.com/T7JFAP2U
>>> http://pastebin.com/u44aaVpWquit
>>> http://pastebin.com/SFKKcGxE
>>> http://pastebin.com/dwSgMsJi
>>> http://pastebin.com/9HdGLm96
>>>
>>> I've put 2.2.3 on the dev box now and will try to replicate on that
>>> box, but its difficult to replicate the traffic artificially. I'll
>>> try to replicate the fault on the dev box over the weekend. I cant
>>> do it on the live gateways because it will affect customer traffic.
>>>
>>> Regards,
>>>
>>> Richard
>>>
>>>
>>> On 03/03/2017 11:28, Richard Robson wrote:
>>>> I've revisited the gateway failover mechanism I had developed in
>>>> order to re route calls to the next gateway on 500's due to
>>>> capacity on the gateways we are using.
>>>>
>>>> we have 3 gateways from one carrier and one from another. The 3
>>>> have 4 cps and will return a 503 or 500 if we breach this. The
>>>> single gateway from the other carrier has plenty of capacity and
>>>> should not be a problem so we want to catch this . and route to the
>>>> next gateway.
>>>>
>>>> We are counting the CPS and channel limits and are routing to the
>>>> next gateway if we exceed the limit set, but There are still
>>>> occasions where a 5XX is generated, which results in a rejected call.
>>>>
>>>>
>>>> We want to stop these rejected calls and therefore want to
>>>> implement the failover mechanism for the 5XX responses. For 6
>>>> months we have been failing over if we think the counts are to high
>>>> on any one gateway without a problem. But when I implement a
>>>> failover on a 5XX response opensips starts crashing.
>>>>
>>>> It's difficult to generate artificial traffic to mimic the real
>>>> traffic, but I've not had a problem with the script in testing.
>>>> Last night I rolled out the new script but by 09:15 this morning
>>>> opensips started crashing 10 times in 5 minutes. This was as the
>>>> traffic ramped up. I rolled back the script and it restarted OK and
>>>> has not crashed since. Therefore the Failover Mechanism in the
>>>> script is where the crash is happening
>>>>
>>>> Core dump: http://pastebin.com/CqnESCm4
>>>>
>>>> I'll add more dumps later
>>>>
>>>> Regards,
>>>>
>>>> Richard
>>>>
>>>>
>>>> this is the failure route catching the 5XX
>>>>
>>>> failure_route[dr_fo] {
>>>>         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn,
>>>> $ft, $fu, $fU, $si, $sp, To: $ru");
>>>>         if (t_was_cancelled()) {
>>>>                 xlog("[dr]call cancelled by internal caller");
>>>>                 rtpengine_manage();
>>>>                 do_accounting("db", "cdr|missed");
>>>>                 exit;
>>>>         }
>>>>
>>>>         if ( t_check_status("[54]03")) {
>>>>                 route(relay_failover);
>>>>         }
>>>>         if ( t_check_status("500")) {
>>>>                 route(relay_failover);
>>>>         }
>>>>
>>>>         do_accounting("db", "cdr|missed");
>>>>         rtpengine_manage();
>>>>         exit;
>>>> }
>>>>
>>>> This is the route taken on the failure
>>>>
>>>>
>>>> route[relay_failover]{
>>>>
>>>>         if (use_next_gw()) {
>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>> $rd");
>>>> $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
>>>> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>>>>
>>>>                 ####### check channel limit ######
>>>> get_profile_size("outbound","$rd","$var(size)");
>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>> $rd var(size) = $var(size)");
>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>> $rd avp(trunkcalllimit) = $avp(trunkchannellimit)");
>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>> $rd  result = ( $var(size) > $avp(trunkchannellimit))");
>>>>                 if ( $(var(size){s.int}) >
>>>> $(avp(trunkchannellimit){s.int})) {
>>>>                         xlog("[relay_failover-route] Trunk $rd
>>>> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>>>>                         route(relay_failover);
>>>>                 }
>>>>         } else {
>>>>                send_reply("503", "Gateways Exhusted");
>>>>                exit;
>>>>         }
>>>>
>>>>         ##### We need to check Rate Limiting #######
>>>>         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
>>>> "TAILDROP")) { # Check Rate limit $avp needs changing
>>>>                 rl_dec_count("$rd"); # decrement the counter since
>>>> we've not "used" one
>>>>                 xlog("[ratelimiter-route] [Max CPS:
>>>> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call to:
>>>> $rU from: $fU CPS exceeded, delaying");
>>>>                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>>>> async(usleep("200000"),relay_failover_delay);
>>>>                 xlog ("Should not get here!!!! after async requst");
>>>>         } else {
>>>>                 xlog ("[relay_outbound-route] [Max CPS:
>>>> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU
>>>> from: $fU not ratelimited");
>>>>         }
>>>>
>>>>         t_on_failure("dr_fo");
>>>>         do_accounting("db", "cdr|missed");
>>>>         rtpengine_manage();
>>>>         if (!t_relay()) {
>>>>                         xlog("[relay-route] ERROR: Unable to relay");
>>>>                         send_reply("500","Internal Error");
>>>>                         exit;
>>>>         }
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: crashing in 2.2.2

Richard Robson
Hi Bogdan,

Yup its under load.

There aren't any errors apart from the WARNING:core:utimer_ticker:
utimer task <tm-utimer> already scheduled for

The live script doesn't have the failover on the 404/50X responses and
copes with the load no problem and never crashes.
It's only when I try and do the failover with the use_next_gw and the
load ramps up to about 1/4 of the normal load. So in testing a made a
few calls and its works fine, but when I put it live its starts crashing
at about 09:15 when the users start coming on line, but our load is
highest after 11:00am so there is load, but not a large amount. It
starts crashing at around 50 concurrent calls and maybe 5 or 6 cps.

I can reproduce it on our test server, but it will disrupt traffic, so
i'd rather do that out of hours, but if I sipp to an invalid number I
can reproduce it but all the cores look similar to me too and have the
reply message in them but i'm no expert at decoding the back traces.

I can get some more Cores for you but I suspect that they will all be
similar.  I would have thought though, that would make the debuging easier?


Let me know id you need anything else from me.

Regards,

Richard

On 07/03/2017 21:10, Bogdan-Andrei Iancu wrote:

> Hi Richard,
>
> Sorry for the slow reply - is this crash happing only under some
> ++load ? do you see any errors from OpenSIPS prior to the crash ?
>
> I noticed that the backtraces in the corefiles are similar - how easy
> is for you to reproduce it ?
>
> Regards,
>
> Bogdan-Andrei Iancu
>   OpenSIPS Founder and Developer
>   http://www.opensips-solutions.com
>
> OpenSIPS Summit May 2017 Amsterdam
>   http://www.opensips.org/events/Summit-2017Amsterdam.html
>
> On 03/07/2017 12:28 PM, Richard Robson wrote:
>> Hi,
>>
>> I've gone over the script and as far as I can see its working as
>> expected until the traffic remps up and then opensips crashes.
>>
>> cores:
>> http://pastebin.com/CgN0h40K
>> http://pastebin.com/ay5TS8zD
>> http://pastebin.com/PGn3AqmU
>>
>> Regards,
>>
>> Richard
>>
>> On 06/03/2017 12:14, Richard Robson wrote:
>>> Hi<
>>>
>>> I've tested this on the latest 2.2.3 with the same results.
>>>
>>> http://pastebin.com/Uixb3v8G
>>>
>>> there were a few of these in the logsd too just before the crash:
>>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079270 ms), it may overlap..
>>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079360 ms), it may overlap..
>>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079460 ms), it may overlap..
>>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079560 ms), it may overlap..
>>> Mar  5 22:02:27 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079660 ms), it may overlap..
>>> Mar  5 22:02:28 gl-sip-03 /usr/sbin/opensips[29875]:
>>> WARNING:core:utimer_ticker: utimer task <tm-utimer> already
>>> scheduled for 204079170 ms (now 204079760 ms), it may overlap..
>>>
>>>
>>> Regards,
>>>
>>> Richard
>>>
>>>
>>>
>>> On 03/03/2017 13:15, Richard Robson wrote:
>>>> More cores
>>>>
>>>> http://pastebin.com/MXW2VBhi
>>>> http://pastebin.com/T7JFAP2U
>>>> http://pastebin.com/u44aaVpWquit
>>>> http://pastebin.com/SFKKcGxE
>>>> http://pastebin.com/dwSgMsJi
>>>> http://pastebin.com/9HdGLm96
>>>>
>>>> I've put 2.2.3 on the dev box now and will try to replicate on that
>>>> box, but its difficult to replicate the traffic artificially. I'll
>>>> try to replicate the fault on the dev box over the weekend. I cant
>>>> do it on the live gateways because it will affect customer traffic.
>>>>
>>>> Regards,
>>>>
>>>> Richard
>>>>
>>>>
>>>> On 03/03/2017 11:28, Richard Robson wrote:
>>>>> I've revisited the gateway failover mechanism I had developed in
>>>>> order to re route calls to the next gateway on 500's due to
>>>>> capacity on the gateways we are using.
>>>>>
>>>>> we have 3 gateways from one carrier and one from another. The 3
>>>>> have 4 cps and will return a 503 or 500 if we breach this. The
>>>>> single gateway from the other carrier has plenty of capacity and
>>>>> should not be a problem so we want to catch this . and route to
>>>>> the next gateway.
>>>>>
>>>>> We are counting the CPS and channel limits and are routing to the
>>>>> next gateway if we exceed the limit set, but There are still
>>>>> occasions where a 5XX is generated, which results in a rejected call.
>>>>>
>>>>>
>>>>> We want to stop these rejected calls and therefore want to
>>>>> implement the failover mechanism for the 5XX responses. For 6
>>>>> months we have been failing over if we think the counts are to
>>>>> high on any one gateway without a problem. But when I implement a
>>>>> failover on a 5XX response opensips starts crashing.
>>>>>
>>>>> It's difficult to generate artificial traffic to mimic the real
>>>>> traffic, but I've not had a problem with the script in testing.
>>>>> Last night I rolled out the new script but by 09:15 this morning
>>>>> opensips started crashing 10 times in 5 minutes. This was as the
>>>>> traffic ramped up. I rolled back the script and it restarted OK
>>>>> and has not crashed since. Therefore the Failover Mechanism in the
>>>>> script is where the crash is happening
>>>>>
>>>>> Core dump: http://pastebin.com/CqnESCm4
>>>>>
>>>>> I'll add more dumps later
>>>>>
>>>>> Regards,
>>>>>
>>>>> Richard
>>>>>
>>>>>
>>>>> this is the failure route catching the 5XX
>>>>>
>>>>> failure_route[dr_fo] {
>>>>>         xlog (" [dr]  Recieved reply to method $rm From: $fd, $fn,
>>>>> $ft, $fu, $fU, $si, $sp, To: $ru");
>>>>>         if (t_was_cancelled()) {
>>>>>                 xlog("[dr]call cancelled by internal caller");
>>>>>                 rtpengine_manage();
>>>>>                 do_accounting("db", "cdr|missed");
>>>>>                 exit;
>>>>>         }
>>>>>
>>>>>         if ( t_check_status("[54]03")) {
>>>>>                 route(relay_failover);
>>>>>         }
>>>>>         if ( t_check_status("500")) {
>>>>>                 route(relay_failover);
>>>>>         }
>>>>>
>>>>>         do_accounting("db", "cdr|missed");
>>>>>         rtpengine_manage();
>>>>>         exit;
>>>>> }
>>>>>
>>>>> This is the route taken on the failure
>>>>>
>>>>>
>>>>> route[relay_failover]{
>>>>>
>>>>>         if (use_next_gw()) {
>>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd");
>>>>> $avp(trunkratelimit)=$(avp(attrs){s.select,0,:});
>>>>> $avp(trunkchannellimit)=$(avp(attrs){s.select,1,:});
>>>>>
>>>>>                 ####### check channel limit ######
>>>>> get_profile_size("outbound","$rd","$var(size)");
>>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd var(size) = $var(size)");
>>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd avp(trunkcalllimit) = $avp(trunkchannellimit)");
>>>>>                 xlog("[relay_failover-route] Selected Gateway is
>>>>> $rd  result = ( $var(size) > $avp(trunkchannellimit))");
>>>>>                 if ( $(var(size){s.int}) >
>>>>> $(avp(trunkchannellimit){s.int})) {
>>>>>                         xlog("[relay_failover-route] Trunk $rd
>>>>> exceeded $avp(trunkchannellimit) concurrent calls $var(size)");
>>>>>                         route(relay_failover);
>>>>>                 }
>>>>>         } else {
>>>>>                send_reply("503", "Gateways Exhusted");
>>>>>                exit;
>>>>>         }
>>>>>
>>>>>         ##### We need to check Rate Limiting #######
>>>>>         if (!rl_check("$rd", "$(avp(trunkratelimit){s.int})",
>>>>> "TAILDROP")) { # Check Rate limit $avp needs changing
>>>>>                 rl_dec_count("$rd"); # decrement the counter since
>>>>> we've not "used" one
>>>>>                 xlog("[ratelimiter-route] [Max CPS:
>>>>> $(avp(trunkratelimit){s.int}) Current CPS: $rl_count($rd)] Call
>>>>> to: $rU from: $fU CPS exceeded, delaying");
>>>>>                 $avp(initial_time)=($Ts*1000)+($Tsm/1000);
>>>>> async(usleep("200000"),relay_failover_delay);
>>>>>                 xlog ("Should not get here!!!! after async requst");
>>>>>         } else {
>>>>>                 xlog ("[relay_outbound-route] [Max CPS:
>>>>> $avp(trunkratelimit) Current CPS: $rl_count($rd)] Call to: $rU
>>>>> from: $fU not ratelimited");
>>>>>         }
>>>>>
>>>>>         t_on_failure("dr_fo");
>>>>>         do_accounting("db", "cdr|missed");
>>>>>         rtpengine_manage();
>>>>>         if (!t_relay()) {
>>>>>                         xlog("[relay-route] ERROR: Unable to relay");
>>>>>                         send_reply("500","Internal Error");
>>>>>                         exit;
>>>>>         }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>


--
Richard Robson
Greenlight Support
01382 843843
[hidden email]


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Loading...