VMEM VS MEM - what is difference


#1

Hi,
My job is stopped and I don’t known why. I suppose that I don’t have enough mem or vmem.
I want to set resources_max.mem and resources_max.vmem . Resources_max.mem i will set to summary of all physically memory from all nodes. But what value of resources_max.vmem I should set? Physically memory + swap or just swap? What is exactly vmem?

I want to set those values because when I try to get information about resources on mypbs (qstat -Qf) I get just “0” values on mem,vmem, etc. Maybe pbs do not know how many resources is available and getting dynamically this values to the job, but it do not know when to stop.


#2

Mom polls the usage of mem, cput, vmem every 120 seconds to the PBS Server.
If your job has run for less than 120 seconds, then there would not be any usage reported by the mom.

Please read this section from the below manual - 3.6.1 Configuring MoM Polling Cycle


#3

I’d advise you to first verify that the lack of resources was indeed the reason why your job didn’t run. The ‘comment’ attribute on the job might give you a better idea. If that doesn’t help, you can look at the scheduler logs to see what happened. Scheduler logs are located at $PBS_HOME/sched_logs/. If it was a resource problem, it will tell you exactly which resource was lacking (mem, or vmem or both). It might also be that the reason why your job didn’t run might be something else entirely.


#4

Job was terminated after about 22 hours so much more than 120 seconds :frowning:
In sched_logs is written that job was run. In the mom_logs on node is written a little bit more:

11/16/2018 07:46:05;0080;pbs_mom;Job;266.xx.yy;task 00000001 terminated
11/16/2018 07:46:05;0008;pbs_mom;Job;266.xx.yy;Terminated
11/16/2018 07:46:05;0100;pbs_mom;Job;266.xx.yy;task 00000001 cput=22:48:23
11/16/2018 07:46:05;0008;pbs_mom;Job;266.xx.yy;kill_job
11/16/2018 07:46:05;0100;pbs_mom;Job;266.xx.yy;n1 cput=22:48:23 mem=257467964kb
11/16/2018 07:46:05;0008;pbs_mom;Job;266.xx.yy;no active tasks
11/16/2018 07:46:05;0100;pbs_mom;Job;266.xx.yy;Obit sent
11/16/2018 07:46:05;0100;pbs_mom;Req;;Type 54 request received from root@10.10.0.1:15001, sock=1
11/16/2018 07:46:05;0080;pbs_mom;Job;266.xx.yy;copy file request received
11/16/2018 07:46:06;0100;pbs_mom;Job;266.xx.yy;staged 2 items out over 0:00:01
11/16/2018 07:46:06;0008;pbs_mom;Job;266.xx.yy;no active tasks
11/16/2018 07:46:06;0100;pbs_mom;Req;;Type 6 request received from root@10.10.0.1:15001, sock=1
11/16/2018 07:46:06;0080;pbs_mom;Job;266.xx.yy;delete job request received
11/16/2018 07:46:06;0008;pbs_mom;Job;266.xx.yy;kill_job

But don’t known why job was killed.

User get this message:

/var/spool/pbs/mom_priv/jobs/266.xx.yy.SC: line 13: 28828 Killed Rscript nameOfRscript.R

Maybe this is error in R script?

In server_logs I find this message:

11/16/2018 07:46:05;0080;Server@pbs;Job;266.xx.yy;Obit received momhop:1 serverhop:1 state:4 substate:42
11/16/2018 07:46:06;0010;Server@pbs;Job;266.xx.yy;Exit_status=137 resources_used.cpupercent=98 resources_used.cput=22:48:23 resource
s_used.mem=257467964kb resources_used.ncpus=38 resources_used.vmem=268250680kb resources_used.walltime=22:52:57
11/16/2018 07:46:06;0100;Server@pbs;Job;266.xx.yy;dequeuing from workq, state 5

Exit_status=137 -> 137 -128=9 in my shell, this means: kill

How to use job comment? It is automatically added to the sched_log, or I must doing something when I run job?
I was set resources_available constants on nodes for: mem(for amount of physical memory), vmem (for amount of swap memory). Is those settings are right? User do not check is this working yet, but I wondering is I doing this right.


#5

qstat -fx would give you more information about the job and not qstat -Qf

qstat -fx 266 # This would give you full information of the job when queued and when running ( check the comment section of this output , this will be updated by the PBS Scheduler )

tracejob 266 # would give you the entire life cycle of the job between server , scheduler, accounting ( Mom as as well , if in case it is all in one node pbs complex )

It must have been terminated by the R application itself. Otherwise, you would have information in the PBS Mom logs , if it were to be by the PBS services.

Jobs comment is automatically updated by the PBS Scheduler.
#qstat -answ1 will show you the jobs comment of a queued/running job
#qstat -fx will show you for the job which is queued, running or completed.

Please share the pbsnodes -av (if in case you have small number of nodes) or if you have large number of nodes, share the output of couple of nodes.


#6

I can’t see any details about job 266 because I don’t have history switch on. We start that job one more time, and then I will see logs again. Thanks for information about qstat -fx, tracejob, etc.
This is output from pbsnodes -av:

$ pbsnodes -av
n1
Mom = n1.xx.yy
Port = 15002
pbs_version = 14.1.2
ntype = PBS
state = free
pcpus = 40
resources_available.arch = linux
resources_available.host = n1
resources_available.mem = 249gb
resources_available.ncpus = 40
resources_available.vmem = 15gb
resources_available.vnode = n1
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

n2
Mom = n2.xx.yy
Port = 15002
pbs_version = 14.1.2
ntype = PBS
state = free
pcpus = 40
resources_available.arch = linux
resources_available.host = n2
resources_available.mem = 249gb
resources_available.ncpus = 40
resources_available.vmem = 15gb
resources_available.vnode = n2
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

n3
Mom = n3.xx.yy
Port = 15002
pbs_version = 14.1.2
ntype = PBS
state = free
pcpus = 40
resources_available.arch = linux
resources_available.host = n3
resources_available.mem = 249gb
resources_available.ncpus = 40
resources_available.vmem = 15gb
resources_available.vnode = n3
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 0
resources_assigned.netwins = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared


#7

Thank you for the pbsnodes -av output.

Please enable the job history :
qmgr -c “set server job_history_enable=True”


#8

Yes, I was do this a few days ago. I have history from 299 job :slight_smile: Thanks!


#9

from the mom logs, it does seem like the program itself terminated, and that’s when PBS killed the job. I’d recommend that you investigate what happened with the program (the R script). Maybe the stageout files have some info?


#10

User splits up R script on less pieces and runs it. After 4 minutes job was killed with Exit_status=1 and R error:

Error: cannot allocate vector of size 12.5 Mb
Execution halted

This means that R has not enough amount of memory. I think why? User requested all available memory:

Resource_List.select = 3:ncpus=39:mem=249gb:vmem=15gb

I decide to unset resources_available.vmem and tell user to run job again without requesting vmem.
Success! After almost six hours job was ended with Exit_status=0 . I looked on the logs:

resources_used.mem=177313292kb
resources_used.vmem=177999696kb

This means that vmem is not just SWAP memory but this is memory with use RAM and SWAP memory according to requirements. Maybe this is term for memory with use vnode.

Thanks for help!