CPU 100% with TCP

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

CPU 100% with TCP

Ben Newlin

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Bogdan-Andrei Iancu-2
Hi Ben,

Could you run "opensipsctl trap" ?

Regards,
Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin



_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users


_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Ben Newlin

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu <[hidden email]>
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list <[hidden email]>, Ben Newlin <[hidden email]>
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin




_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users




_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Bogdan-Andrei Iancu-2
Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin




_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users





_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Ben Newlin

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu <[hidden email]>
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin <[hidden email]>, OpenSIPS users mailling list <[hidden email]>
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin





_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users







_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Bogdan-Andrei Iancu-2
Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,
Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin





_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users








_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Ben Newlin

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu <[hidden email]>
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin <[hidden email]>, OpenSIPS users mailling list <[hidden email]>
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin






_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users











_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Bogdan-Andrei Iancu-2
Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlock occurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).


Regards,
Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin






_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users












_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Ben Newlin

Bogdan,

 

I am trying to obtain this information for you but I am having trouble getting the core files. Is it really necessary to kill every opensips process? This generates almost 40 core files and each is quite large (~1GB). I simply don’t have that disk space currently. I can make a change to get more but it is slowing the process. Would it be sufficient to get just one core file?

 

Also, runtime inspection with gdb is possible in this case if you can provide me with the commands you would want to see. I would need very specific commands as I am not very familiar with gdb.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu <[hidden email]>
Date: Thursday, November 1, 2018 at 1:29 PM
To: Ben Newlin <[hidden email]>, OpenSIPS users mailling list <[hidden email]>
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlock occurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).

Regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,




Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,




Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin







_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
















_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Bogdan-Andrei Iancu-2
Hi Ben,

Sorry for not being able to answer you before sending the new set of BTs. Indeed, getting the corefile of only one process will do it as the locks (and debug info) are in the shared memory. So, the deadlock happens again, do the "opensipsctl trap" and get the corefile of one process (ideally an UDP worker - get its pid via "opensipsctl fifo ps").
Keep the core as we will have to dig into it together :).

Many thanks,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 11/06/2018 10:14 PM, Ben Newlin wrote:

Bogdan,

 

I am trying to obtain this information for you but I am having trouble getting the core files. Is it really necessary to kill every opensips process? This generates almost 40 core files and each is quite large (~1GB). I simply don’t have that disk space currently. I can make a change to get more but it is slowing the process. Would it be sufficient to get just one core file?

 

Also, runtime inspection with gdb is possible in this case if you can provide me with the commands you would want to see. I would need very specific commands as I am not very familiar with gdb.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Thursday, November 1, 2018 at 1:29 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlock occurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).

Regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,




Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,




Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin







_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users

















_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Ben Newlin

Bogdan,

 

Can you clarify if you’re saying you need more information beyond the dumps I’ve just provided to you off-list?

 

Ben Newlin

 

From: Bogdan-Andrei Iancu <[hidden email]>
Date: Tuesday, November 13, 2018 at 11:11 AM
To: Ben Newlin <[hidden email]>, OpenSIPS users mailling list <[hidden email]>
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Sorry for not being able to answer you before sending the new set of BTs. Indeed, getting the corefile of only one process will do it as the locks (and debug info) are in the shared memory. So, the deadlock happens again, do the "opensipsctl trap" and get the corefile of one process (ideally an UDP worker - get its pid via "opensipsctl fifo ps").
Keep the core as we will have to dig into it together :).

Many thanks,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 11/06/2018 10:14 PM, Ben Newlin wrote:

Bogdan,

 

I am trying to obtain this information for you but I am having trouble getting the core files. Is it really necessary to kill every opensips process? This generates almost 40 core files and each is quite large (~1GB). I simply don’t have that disk space currently. I can make a change to get more but it is slowing the process. Would it be sufficient to get just one core file?

 

Also, runtime inspection with gdb is possible in this case if you can provide me with the commands you would want to see. I would need very specific commands as I am not very familiar with gdb.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Thursday, November 1, 2018 at 1:29 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlock occurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).

Regards,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,





Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,





Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin








_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users






















_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Bogdan-Andrei Iancu-2
Ben,

The dump you send is actually the backtrace from all the procs. What is the Holly Grail is the corefile to be inspected via gdb (off list, of course).

Thanks and regards,

Bogdan-Andrei Iancu

OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/
On 11/13/2018 07:36 PM, Ben Newlin wrote:

Bogdan,

 

Can you clarify if you’re saying you need more information beyond the dumps I’ve just provided to you off-list?

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Tuesday, November 13, 2018 at 11:11 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Sorry for not being able to answer you before sending the new set of BTs. Indeed, getting the corefile of only one process will do it as the locks (and debug info) are in the shared memory. So, the deadlock happens again, do the "opensipsctl trap" and get the corefile of one process (ideally an UDP worker - get its pid via "opensipsctl fifo ps").
Keep the core as we will have to dig into it together :).

Many thanks,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 11/06/2018 10:14 PM, Ben Newlin wrote:

Bogdan,

 

I am trying to obtain this information for you but I am having trouble getting the core files. Is it really necessary to kill every opensips process? This generates almost 40 core files and each is quite large (~1GB). I simply don’t have that disk space currently. I can make a change to get more but it is slowing the process. Would it be sufficient to get just one core file?

 

Also, runtime inspection with gdb is possible in this case if you can provide me with the commands you would want to see. I would need very specific commands as I am not very familiar with gdb.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Thursday, November 1, 2018 at 1:29 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlock occurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).

Regards,


Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,





Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,





Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin








_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users























_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users
Reply | Threaded
Open this post in threaded view
|

Re: CPU 100% with TCP

Ben Newlin

Bogdan,

 

I do still have the corefiles, but isn’t that useless to you without my exact system (binaries, modules,etc)? Either way, I’m happy to send it.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu <[hidden email]>
Date: Tuesday, November 13, 2018 at 3:29 PM
To: Ben Newlin <[hidden email]>, OpenSIPS users mailling list <[hidden email]>
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Ben,

The dump you send is actually the backtrace from all the procs. What is the Holly Grail is the corefile to be inspected via gdb (off list, of course).

Thanks and regards,

Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 11/13/2018 07:36 PM, Ben Newlin wrote:

Bogdan,

 

Can you clarify if you’re saying you need more information beyond the dumps I’ve just provided to you off-list?

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Tuesday, November 13, 2018 at 11:11 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Sorry for not being able to answer you before sending the new set of BTs. Indeed, getting the corefile of only one process will do it as the locks (and debug info) are in the shared memory. So, the deadlock happens again, do the "opensipsctl trap" and get the corefile of one process (ideally an UDP worker - get its pid via "opensipsctl fifo ps").
Keep the core as we will have to dig into it together :).

Many thanks,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 11/06/2018 10:14 PM, Ben Newlin wrote:

Bogdan,

 

I am trying to obtain this information for you but I am having trouble getting the core files. Is it really necessary to kill every opensips process? This generates almost 40 core files and each is quite large (~1GB). I simply don’t have that disk space currently. I can make a change to get more but it is slowing the process. Would it be sufficient to get just one core file?

 

Also, runtime inspection with gdb is possible in this case if you can provide me with the commands you would want to see. I would need very specific commands as I am not very familiar with gdb.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Thursday, November 1, 2018 at 1:29 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

First be sure you have the DBG_LOCK option compiled in. Do the "opensips -V" and see the output flags.

Next step will be to force an SIGSEGV to opensips (killall -11 opensips) when the deadlock occurs - I need a core file to inspect (assuming that runtime inspection with gdb is not possible).

Regards,



Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/31/2018 09:07 PM, Ben Newlin wrote:

Bogdan,

 

For the first test I have done as you suggested and disabled only async operation for HEP, so it is still using TCP. I will send you the trap info directly as it is too large. I also compiled with the DBG_LOCK option, but am unsure whether that extra information will be available in the trap output or do you need something else?

 

I am now going to switch HEP to use UDP to mirror our production environment and try to reproduce again. Wish me luck!

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Monday, October 29, 2018 at 2:19 PM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

I checked the error trace and it should not leave any dangling lock (due mishandled error). Before disabling HEP, try to disable the async support for HEP.

If you claim that the same 100% CPU happens with HEP + UDP, send me a trap for that too, as in the previous case, the deadlock was exclusively HEP + TCP related.

Anyhow, as the original trap showed a deadlock, next step will be to recompile with the DBG_LOCK option - this enables extra code to debug/troubleshoot locking related issues - are you able to do it?

Regards,




Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/26/2018 04:14 PM, Ben Newlin wrote:

Bogdan,

 

Actually, yes we do. Looking back I can see these errors just before the issue occurs:

 

Oct 24 19:00:36 [5700] ERROR:proto_hep:send_hep_message: Cannot send hep message!

Oct 24 19:00:36 [5700] ERROR:proto_hep:msg_send: send() to 10.32.163.211:9061 for proto hep_tcp/9 failed

Oct 24 19:00:36 [5700] ERROR:proto_hep:hep_tcp_send: failed to send

Oct 24 19:00:36 [5700] ERROR:proto_hep:async_tsend_stream: Failed first TCP async send : (32) Broken pipe

 

I will try disabling HEP and see if we can reproduce.

 

Just for information, I have been reproducing the issue in our testing environment which uses TCP for HEP, however the issue is occurring in our production environment as well which is still using UDP for HEP.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Friday, October 26, 2018 at 3:06 AM
To: Ben Newlin [hidden email], OpenSIPS users mailling list [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Thank you for the info.

It looks like the processes get stuck into a HEP related internal lock - do you see any HEP related errors in your logs, prior to the dead-lock ?

Also, as PoC, could you disabled HEP tracing to see if the problem goes away ?

Thanks,






Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 10:18 PM, Ben Newlin wrote:

Bogdan,

 

I have run the command but the output was too large for pastebin so I have sent it to you directly.

 

Ben Newlin

 

From: Bogdan-Andrei Iancu [hidden email]
Date: Wednesday, October 24, 2018 at 5:17 AM
To: OpenSIPS users mailling list [hidden email], Ben Newlin [hidden email]
Subject: Re: [OpenSIPS-Users] CPU 100% with TCP

 

Hi Ben,

Could you run "opensipsctl trap" ?

Regards,






Bogdan-Andrei Iancu
 
OpenSIPS Founder and Developer
  http://www.opensips-solutions.com
OpenSIPS Bootcamp 2018
  http://opensips.org/training/OpenSIPS_Bootcamp_2018/

On 10/24/2018 12:56 AM, Ben Newlin wrote:

Hi,

 

We have implemented TCP recently and are performing TCP<->UDP translation on one of our proxy types. This proxy only exists for that purpose; there are no DB queries, REST calls, or anything like that. It is designed to be very fast and high throughput.

 

Recently we have found that when the remote endpoint of a TCP connection is lost, i.e. the server goes down, while under moderate load OpenSIPS quickly reaches 100% CPU and becomes unresponsive. When this occurs, the “top” command shows that between 30-90% CPU is in System (kernel) space, and each OpenSIPS TCP process shows many times the normal CPU. We are running OpenSIPS 2.4.2 on Amazon Linux.

 

I obtained as much information as I could using ps, strace, and gdb here: https://pastebin.com/JP3DnCqs. We can reproduce the failure consistently by removing a server during call traffic.

 

A few things I noticed:

  • The number of running threads reported by OpenSIPS doesn’t align with our configuration, copied here:

 

####### Global Parameters #########

 

children=32

 

#// Allow 503 to pass back to Control

disable_503_translation=yes

 

#// Even though we are not receiving HEP,

#// this listener is required by OpenSIPS

#// in order to use the proto_hep module. :/

listen=hep_tcp:10.32.40.245:9061 use_children 1

 

#// Configure the listeners

listen=udp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

listen=tcp:10.32.40.245:5060 as XXX.XXX.XXX.XXX

 

#// Transaction Module

loadmodule "tm.so"

modparam("tm", "restart_fr_on_each_reply", 0)

modparam("tm", "timer_partitions", 8)

modparam("tm", "onreply_avp_mode", 1)

modparam("tm", "wt_timer", 10)

 

According to the documentation if “tcp_children” is not set then the value of “children” will be used [1], but we have set “children” to 32 and only have the default 8 TCP processes. Also we appear to only have 1 timer process, although we have set the number of timer partitions to 8.

  • The server that is terminated was using TCP connections exclusively, but all of the CPU seems to be in the UDP threads. The one I looked at appeared to be handling a CANCEL to one of the calls that was active and was attempting to send it out via TCP. I’m not sure why it would be trying to relay the CANCEL as no 100 Trying had been received from the server. I have noticed that in 2.x OpenSIPS will now send CANCELs for transactions even when 100 Trying was not received. Is that intentional? RFC 3261 states that no CANCEL should be sent unless a provisional response has been received.

 

Any assistance with this would be appreciated.

 

[1] - http://www.opensips.org/Documentation/Script-CoreParameters-2-4#toc66

 

 

Ben Newlin









_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users





























_______________________________________________
Users mailing list
[hidden email]
http://lists.opensips.org/cgi-bin/mailman/listinfo/users