PBS submitting job to wrong node


#1

Hi All,

We are facing an intermitent issue with our pbs installation.

We have following queues configured:

Normal (default)
Large
Xlarge
XXlarge

Each queue has separate nodes configured to receive jobs only from that queue. This is achieved with the following command:

qmgr -c “set node queue=”

Problem: Sometimes when a job is submitted to the large queue, PBS places the job in a machine configured to accept jobs only from the normal(default queue).
The command qstat -a shows that the job is running in the large queue, however, pbsnodes -a and qstat -f <job_id> shows that the machine which is running the job has its queue parameter set to Normal.

While PBS works properly most of the time, this issue appears once in a while and then gets fixed automatically, and we are not able to reproduce the issue.

Any idea what may be causing this? What can be done to prevent this from happening?

Here is the server configuration:

Create resources and set their properties.

Create and define resource slot_type

create resource slot_type
set resource slot_type type = string
set resource slot_type flag = h

Create and define resource cpuf

create resource cpuf
set resource cpuf type = string
set resource cpuf flag = h

Create and define resource ndisks

create resource ndisks
set resource ndisks type = string
set resource ndisks flag = h

Create and define resource SPEED

create resource SPEED
set resource SPEED type = string
set resource SPEED flag = h

Create and define resource physical_srv

create resource physical_srv
set resource physical_srv type = string
set resource physical_srv flag = h

Create and define resource OSNAME

create resource OSNAME
set resource OSNAME type = string
set resource OSNAME flag = h

Create and define resource model

create resource model
set resource model type = string
set resource model flag = h

Create queues and set their attributes.

Create and define queue xxlarge

create queue xxlarge
set queue xxlarge queue_type = Execution
set queue xxlarge resources_default.slot_type = xxlarge
set queue xxlarge enabled = True
set queue xxlarge started = True

Create and define queue xlarge

create queue xlarge
set queue xlarge queue_type = Execution
set queue xlarge resources_default.slot_type = xlarge
set queue xlarge enabled = True
set queue xlarge started = True

Create and define queue normal

create queue normal
set queue normal queue_type = Execution
set queue normal resources_default.slot_type = execute
set queue normal enabled = True
set queue normal started = True

Create and define queue large

create queue large
set queue large queue_type = Execution
set queue large resources_default.slot_type = large
set queue large enabled = True
set queue large started = True

Set server attributes.

set server scheduling = True
set server managers = root@*
set server default_queue = normal
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.ncpus = 1
set server resources_default.slot_type = execute
set server default_chunk.ncpus = 1
set server scheduler_iteration = 15
set server flatuid = True
set server resv_enable = True
set server node_fail_requeue = 310
set server max_array_size = 10000
set server pbs_license_min = 0
set server pbs_license_max = 2147483647
set server pbs_license_linger_time = 31536000
set server eligible_time_enable = False
set server job_history_enable = True
set server max_concurrent_provision = 5


#2

Hi,

I am also facing exactly the same problem. Anybody has any idea what is missing in the configuration?

Thanks,
-Partha


#3

It is a not a good configuration to hardcode node to the queue ( qmgr -c “set node queue=” ) , instead use Qlists.

  1. qmgr -c "create resource nodetype type=string_array,flag=h"

  2. add “nodetype” to the sched_config’s resources: “…,nodetype” line and kill -HUP

  3. qmgr -c “s q normal default_chunk.nodetype=normal”
    qmgr -c “s q large default_chunk.nodetype=large”
    qmgr -c “s q xlarge default_chunk.nodetype=xlarge”
    qmgr -c “s q xxlarge default_chunk.nodetype=xxlarge”

  4. for i in normal_node_types ; do qmgr -c “set node $i resources_available.nodetype=normal”
    for i in large_node_types ; do qmgr -c “set node $i resources_available.nodetype=large”
    for i in xlarge_node_types ; do qmgr -c “set node $i resources_available.nodetype=xlarge”
    for i in xxlarge_node_types ; do qmgr -c “set node $i resources_available.nodetype=xxlarge”

  5. Now submit jobs respective queues, they will land on to respective nodes

Please check this section 4.9.2.2.i Procedure to Associate Vnodes with Multiple Queues from

Thank you


#4

Hi Adarsh,

Thank you. This worked.