Jobs Immediately Exiting (Email Spamming)


#1

Hello,

We experienced issues with simple jobs failing to run so I attempted to run an interactive job and as soon as the job is ready it exits:

[ccastest@bright01-thx ~]$ qsub -I -l select=1:ncpus=5:mem=8gb -l walltime=01:00:00
qsub: waiting for job 7106.bright01-thx to start
qsub: job 7106.bright01-thx ready


qsub: job 7106.bright01-thx completed

After reducing the ncpus to four or lower, the job got assigned to a node (and it was always the same node - node0108 - regardless of whether the ncpus was 1, 2 or 3).

When I increased the ncpus count to 5 or higher, I got the behavior above.

Keep in mind there are 90 fully free nodes in the cluster so there are more than enough cores/mem to satisfy this request. It seems to me PBS Pro is stuck on feeding node0108 rather than the other free nodes. This behavior persisted even after rebooting PBS Pro, scheduler and all other processes (except the moms on the vnodes). We even modified the smp_cluster_dist in the $PBS_HOME/sched_priv/sched_config and rebooted the PBS Pro server to no avail.

Next, I took node0108 offline, and to my amazement, the interactive jobs with ncpus=4 or lower which once were successfully getting assigned to node0108 began starting and completing immediately as previously observed above for the ncpus=5 or higher. So it seems PBS Pro is stuck on having node0108 regardless of its status.

Can someone please explain this behavior? I expect that PBS Pro scheduler would move to another available node if a previously selected node becomes hosed or unavailable for whatever reason. And if the job (interactive or not) cannot be satisfied at all due to unavailable resources, then it should sit in queue till resources come available rather than start and quit immediately.

Also, how do I fix this? Since strangely, newly submitted jobs are getting held and spamming users email with at least 20 failure emails which I suspect is related to this persistent behavior of feeding just that one node…

Thanks in advance,
Siji


#2

Hey @sijisaula
The scheduler is deterministic. If node0108 is the first free node it finds, it will schedule a job on that node. If you start a job and it immediately ends, then the node is free again to start a new job on. The issue isn’t the scheduler, it’s the fact that your jobs immediately end. I’d look in the mom logs and see if it had a reason for why the job had problems.

If you want PBS to stop using a node it thinks is up and perfectly fine, you will need to tell PBS that. This is usually done in a mom hook. The execjob_begin and execjob_prologue hook events are run before the job starts. You can do some node health checks in the hook and if needed, put the node in the offline state. If the hook event is rejected, the job is requeued. Since the node is now in the offline state, the scheduler will ignore that node in its future scheduling decisions. If python is not your language of choice, you can use our older prologue. It is a shell script that is run before the job starts.

Other interesting things:
When you said you offlined the node, did you mean put it in the offline state (pbsnodes -o node0108)? If so, the scheduler should ignore it from that point forward. There is a short race condition that if the scheduler is in cycle when you offline the node, the scheduler won’t notice this fact until the next cycle.

I don’t think smp_cluster_dist will do what you want it to do. I’d avoid using it since it has been deprecated for many years now. If set to round_robin, it will try to cycle around the nodes during the same cycle. From the sounds of it you are running one job per cycle. It won’t help in this case. In any case, the more modern method of achieving round robin is to set node_sort_key: “HIGH unused”. This will sort your nodes based on the number of unused cpus. Once some of the cpus on a node are used, the node is lowered in priority.

The reason your users get 20 emails is that PBS will try and run a job 20 times before holding it. It figures that if a job has been run 20 times, there is something wrong and continuing to run this job will not help anything.

So once again, the problem is on the mom side. Please take a look at the mom log in the period of when the job starts. It might give you more information on why the job immediately is ending.

Bhroam


#3

Hi @bhroam,

Thanks for sharing your thoughts.

I’ve checked the mom_logs directory but that has no logs at all and my mom_priv/config looks like this:

#cat mom_priv/config
$clienthost bright01-thx

Additionally, I did some more testing and noticed that when the mem=8gb is reduced to 1gb, the job does not exist immediately. However, for mem values greater than 1gb it will:

[ccastest@bright01-thx ~]$ qsub -I -q default -l select=1:ncpus=1:mem=1gb -l walltime=01:00:00
qsub: waiting for job 7168.bright01-thx to start
qsub: job 7168.bright01-thx ready

[ccastest@node0115 ~]$ logout

qsub: job 7168.bright01-thx completed
[ccastest@bright01-thx ~]$ qsub -I -q default -l select=1:ncpus=1:mem=2gb -l walltime=01:00:00
qsub: waiting for job 7169.bright01-thx to start
qsub: job 7169.bright01-thx ready


qsub: job 7169.bright01-thx completed 

Is there some setting that may explain this behavior? node0108 is out of the question now. And the node0115 the job is being assigned to surely has enough memory:

[ccastest@bright01-thx ~]$ pbsnodes node0115
node0115
Mom = node0115.thunder.ccast
ntype = PBS
state = free
pcpus = 36
jobs = 6781.bright01-thx/0, 6781.bright01-thx/1, 6781.bright01-thx/2, 6781.bright01-thx/3, 6781.bright01-thx/4, 6781.bright01-thx/5, 6781.bright01-thx/6, 6781.bright01-thx/7, 6781.bright01-thx/8, 6781.bright01-thx/9, 6781.bright01-thx/10, 6781.bright01-thx/11, 6781.bright01-thx/12, 6781.bright01-thx/13, 6781.bright01-thx/14, 6781.bright01-thx/15, 6781.bright01-thx/16, 6781.bright01-thx/17, 6781.bright01-thx/18, 6781.bright01-thx/19
resources_available.arch = linux
resources_available.host = node0115
resources_available.mem = 1503238553b
resources_available.ncpus = 36
resources_available.ngpus = 8
resources_available.vnode = node0115
resources_assigned.accelerator_memory = 0kb
resources_assigned.mem = 0kb
resources_assigned.mic_cores = 0
resources_assigned.mic_density = 0kb
resources_assigned.mic_size = 0kb
resources_assigned.naccelerators = 0
resources_assigned.ncpus = 20
resources_assigned.netwins = 0
resources_assigned.ngpus = 0
resources_assigned.nmics = 0
resources_assigned.res-condo01 = 0
resources_assigned.vmem = 0kb
resv_enable = True
sharing = default_shared

I did a check on one of the jobs that exited immediately and noticed that qstat claimed it ran on node0107:

#qstat -xf 7166
Job Id: 7166.bright01-thx
Job_Name = STDIN
Job_Owner = ccastest@bright01-thx.thunder.ccast
resources_used.cpupercent = 0
resources_used.cput = 00:00:00
resources_used.mem = 0kb
resources_used.ncpus = 1
resources_used.vmem = 0kb
resources_used.walltime = 00:00:03
job_state = F
queue = def-devel
server = bright01-thx
Checkpoint = u
ctime = Thu Jun 28 14:50:45 2018
Error_Path = /dev/pts/0
exec_host = node0107/1
exec_vnode = (node0107:ncpus=1:mem=8388608kb)
Hold_Types = n
interactive = True
Join_Path = n
Keep_Files = n
Mail_Points = a
mtime = Thu Jun 28 14:50:48 2018
Output_Path = /dev/pts/0
Priority = 0
qtime = Thu Jun 28 14:50:45 2018
Rerunable = False
Resource_List.mem = 8gb
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.place = free
Resource_List.select = 1:ncpus=1:mem=8gb
Resource_List.walltime = 01:00:00
stime = Thu Jun 28 14:50:45 2018
session_id = 182109
jobdir = /home/ccastest
substate = 92
Variable_List = PBS_O_HOME=/home/ccastest,PBS_O_LANG=en_US.UTF-8,
PBS_O_LOGNAME=ccastest,
PBS_O_PATH=/gpfs1/apps/spack/bin:/home/ccastest/.rvm/gems/ruby-2.2.7/b
in:/home/ccastest/.rvm/gems/ruby-2.2.7@global/bin:/home/ccastest/.rvm/r
ubies/ruby-2.2.7/bin:/cm/local/apps/environment-modules/4.0.0//bin:/usr
/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/ibutils/bin:/sb
in:/usr/sbin:/cm/local/apps/environment-modules/4.0.0/bin:/cm/shared/ap
ps/pbspro-ce/current/bin:/home/ccastest/.rvm/bin:/home/ccastest/bin:/us
r/local/maui/bin:/gpfs1/apps/global/opt/ansys_inc/electronics_v190/inst
all/AnsysEM19.0/Linux64,PBS_O_MAIL=/var/spool/mail/ccastest,
PBS_O_SHELL=/bin/bash,PBS_O_WORKDIR=/home/ccastest,PBS_O_SYSTEM=Linux,
PBS_O_QUEUE=def-devel,PBS_O_HOST=bright01-thx.thunder.ccast
comment = Job run at Thu Jun 28 at 14:50 on (node0107:ncpus=1:mem=8388608kb
) and finished
etime = Thu Jun 28 14:50:45 2018
run_count = 1
Exit_status = 0
Submit_arguments = -I -q def-devel -l select=1:ncpus=1:mem=8gb -l walltime=
01:00:00
history_timestamp = 1530215448
project = _pbs_project_default

Any additional thoughts would be appreciated as the probable root cause is still not clear to me.

Thanks,
Siji


#4

Did some further digging and apparently node0107 was the source of the immediate job completions as it had really high load averages.

Once I took that node offline, we immediate completions of interactive jobs ceased, regardless of resource requests.

Thanks again for all your assistance!

Siji