FOR TIPS, gUIDES & TUTORIALS

subscribe to our Youtube

GO TO YOUTUBE

14175 questions

16819 answers

27670 comments

54159 members

+1 vote
270 views 6 comments
by

We have a production environment with around 100 RUT955 devices. A couple of months ago we updated serveral devices to fw version RUT9_R_00.07.01.4. 

After about 50 days uptime we are getting high ping responses from those devices. Remote ssh is not available (connection refused or timeout). WebGUI login fails with "device busy". 

Device logs show out of memory problems with process "ports_eventsd" as the source:

[4268791.398774] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),task=port_eventsd,pid=2632,uid=0 
[4268791.407894] Out of memory: Killed process 2632 (port_eventsd) total-vm:67244kB, anon-rss:65532kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:80kB oom_score_adj:0 
[4268792.032643] oom_reaper: reaped process 2632 (port_eventsd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[4586670.162124] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),task=port_eventsd,pid=2632,uid=0
[4586670.171269] Out of memory: Killed process 2632 (port_eventsd) total-vm:71784kB, anon-rss:70076kB, file-rss:4kB, shmem-rss:0kB, UID:0 pgtables:88kB oom_score_adj:0
[4586670.965407] oom_reaper: reaped process 2632 (port_eventsd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
After the oom-kill device response time is back to normal. But before this we see 24-72 hrs with reduced performance and lost connectivity. 
Is this a fw related bug? Any workaround?
by

As far as we can see this is connected to firmware v7 and can be found on both RUT955 and RUT956 devices. The port_eventsd has increasing memory usage over time.

Firmware v7, RUT955:

[email protected]:~# cat /etc/version && uptime && free && top -n 1 |grep eventsd

RUT9_R_00.07.01.4

 10:09:29 up 48 days, 16:13,  load average: 0.03, 0.04, 0.00

              total        used        free      shared  buff/cache   available

Mem:         124832       93832       21396         280        9604        1484

Swap:             0           0           0

 4599     1 root     S    66844  53%   0% /usr/bin/port_eventsd --suppress-topol

[email protected]:~#

Firmware v7, RUT956:

[email protected]:~# cat /etc/version && uptime && free && top -n 1 |grep eventsd

RUT9M_R_00.07.01.7

 12:08:34 up 32 days, 22:11,  load average: 0.25, 0.30, 0.34

              total        used        free      shared  buff/cache   available

Mem:         123268       73828       25900         220       23540       12988

Swap:             0           0           0

 2575     1 root     S    45924  37%   9% /usr/bin/port_eventsd --suppress-topol

[email protected]:~#

Firmware v6, RUT955:

[email protected]:~# cat /etc/version && uptime && free && top -n 1 |grep eventsd

RUT9XX_R_00.06.06.1

 10:10:48 up 27 days,  5:01,  load average: 0.27, 0.06, 0.02

              total        used        free      shared  buff/cache   available

Mem:         125984       25728       74764         492       25492       99040

Swap:             0           0           0

25175 25148 root     S     1536   1%   0% grep eventsd

[email protected]:~#

1 Answer

0 votes
by
Hello,

Maybe there is a possibility to get a full troubleshoot file where these logs are visible or more of the logs are visible, that would allow me to create a better case for our RnD department to look more deeply into this.

Thank you
by
Any updates on this issue? During last week the situation got a lot worse since we are around 50 days from the rollout of RUTOS v7. We have implemented a temporary workaround where we check the free memory and if the value is <30000kb we run killall port_eventsd.

What is the purpose of the port_eventsd process?

Could it be disabled?
by
Could you post the output of cat /proc/(pid of port_eventsd)/status ?
by
[email protected]:~# cat /proc/2558/status
Name:   port_eventsd
Umask:  0022
State:  S (sleeping)
Tgid:   2558
Ngid:   0
Pid:    2558
PPid:   1
TracerPid:      0
Uid:    0       0       0       0
Gid:    0       0       0       0
FDSize: 32
Groups:
NStgid: 2558
NSpid:  2558
NSpgid: 1
NSsid:  1
VmPeak:    37844 kB
VmSize:    37844 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     37020 kB
VmRSS:     37020 kB
RssAnon:           36128 kB
RssFile:             892 kB
RssShmem:              0 kB
VmData:    36156 kB
VmStk:       132 kB
VmExe:        16 kB
VmLib:      1536 kB
VmPTE:        52 kB
VmSwap:        0 kB
CoreDumping:    0
THP_enabled:    0
Threads:        1
SigQ:   0/953
SigPnd: 00000000000000000000000000000000
ShdPnd: 00000000000000000000000000000000
SigBlk: 00000000000000000000000000000000
SigIgn: 00000000000000000000000000001000
SigCgt: 00000000000000000000000000024002
CapInh: 0000000000000000
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0
Speculation_Store_Bypass:       unknown
Cpus_allowed:   1
Cpus_allowed_list:      0
voluntary_ctxt_switches:        2305029
nonvoluntary_ctxt_switches:     1480045
[email protected]:~#
by

The most interesting field to monitor is VmRSS, it looks high this probably indicates a memory leak. Re-check in a few hours and post the new cat output. The value of cat /proc/2558/oom_score could also be of interest.

htop gives a better view of the memory usage than top:

opkg update; opkg install htop

by
Hello,

We found the memory leak on our side and will be no more with 7.2.6 FW release (as currently planned)