Exclude the node from qsub


#1

Hello! Can I exclude the host when I submit the job? If I can, how to do that?


Qsub specific hosts
#2

When submitting the a job via qsub , we are requesting the resources that is required to run a job.
If you would like to exclude one or more resources ( compute nodes), then you can tag the nodes with a custom resource , and use the custom resource of your choice to tell the scheduler to select the nodes which has the custom resource set.

For example - you have 3 nodes n1 n2 and n3

  • Create a custom resource called “node_select”
    qmgr -c “create resource node_select type=string_array,flag=h”

  • Add node_select to the sched_config’s resources: line and kill -HUP < PID of the PBS_SCHED >

  • add the custom resource “node_select” to all the nodes:

qmgr -c ‘s n n1 resources_available.node_select=yes’
qmgr -c ‘s n n2 resources_available.node_select=yes’
qmgr -c ‘s n n3 resources_available.node_select=no’

  • Say, now you would like to avoid node n3 , then your qsub statement should like below

qsub -l select=1:ncpus=1:mem=100mb:node_select=yes – /bin/sleep 1000

Thank you


#3

Than you for your answer!

But what if it should be dynamically? One user want exclude n1, second user want to exclude n3.
I have the farm with 30 compunodes. Sometimes users want to exclude different servers fore some reasons.

What shall I do?


#4

Dynamically we cannot exclude resources , once it has been matched to a job,
you can qalter the request of a QUEUED job and job wide resources of a RUNNING job.

The user can request as below
qsub -l select=1:ncpus=1:mem=100mb:host=n1+1:ncpus=1:mem=100mb:host=n2
qsub -l nodes=n1+n2


#5

It’s possible, but requires a lot of work to do (for admin, not for user, after doing this, user only need a single qsub to route queue)

  • create a route_queue say route
  • create a queue for every user, say q_userA q_userB q_userC …
  • create a boolean nodelevel (flag=h, type=boolean) resources for every user, say run_userA, run_userB, run_userC …
  • use acl_control of q_userA q_userB q_userC … to assure only specific user would enter this queue
    let’s say made userA routing to q_userA, and assign the queue with default chunk
qmgr -c "s q q_userA acl_enabled=t"
qmgr -c "s q q_userA acl_users+=userA@*"
qmgr -c "s q q_userA default_chunk.run_userA=t"
  • add destination to route of all these queues
qmgr -c "s q route destinations+=q_userA"

… (so does other queues )

  1. mark the nodes with your collections of nodes, as you mentions, mark all other nodes except n1 with resources run_userA=t (… do this for every user you want to control)

after this, user only need to do

qsub -l select=1:ncpus=1 -q route -- /bin/sleep 1000

btw, I might have some typo of the commands, as i’m typing on the fly without test, but this way should do the trick.

FYI


#6

Another solution might be , to create a custom string_array host level resource (allowed_users) and enable it in the sched_config file.

  1. user server (or periodic mom hook) to read a centralised text file which has contents ( might be in this format ) ,
    Node Users_Allowed
    n1 user1, user2
    n2 user2
    n3 user3

  2. The hook will read this file accordingly and update the node configuration with respect to allowed_users for each of the nodes . You can dynamically update that text file.


#7

@adarsh: f I follow your suggestion, I get the following error message:

$ qmgr -c “create resource node_select type=string_array,flag=h”
Can not resolve name for server resource. (rc = -1 - )
Cannot resolve specified server host ‘resource’.
qmgr: cannot connect to server resource (errno=15010) Access from host not allowed, or unknown host
Can not resolve name for server node_select. (rc = -1 - )
Cannot resolve specified server host ‘node_select’.
qmgr: cannot connect to server node_select (errno=15010) Access from host not allowed, or unknown host
Can not resolve name for server type=string_array. (rc = -1 - )
Can not resolve name for server flag=h”. (rc = -1 - )
Cannot resolve specified server host ‘type=string_array,flag=h”’.
qmgr: cannot connect to server type=string_array,flag=h” (errno=15010) Access from host not allowed, or unknown host

Do I need to be administrator to do this? Here is the PBS version installed

$ qstat --version
Version: 6.1.2
Commit: 661e092552de43a785c15d39a3634a541d86898e

Thank you for your help


#8

@sacha89 : You are not running PBS Pro OSS. You must be running torque.
Please get the latest version of PBS Pro OSS from the below link (latest version is 18.1.3)

Documentation: https://www.pbsworks.com/SupportGT.aspx?d=PBS-Professional,-Documentation

Some information:


#9

You are right, I am running torque. The administrator won’t let me install PBS Pro OSS. Do you know whether there is a way to exclude specific nodes with torque, or whether I could find such information?

I have found this answer but again, it requires that the administrator assigns some properties to the nodes, and he won’t do it.


#10

Man page of qsub on the cluster should have information about this kind of job request, to specify on which exact nodes you want your job to run.

This is a common practice to optimize scheduling , the administrators tag the nodes specific to applications or specific hardware or network topology ( switch, router , rack etc )


#11

adarsh: Man page of qsub on the cluster should have information about this kind of job request, to specify on which exact nodes you want your job to run.

Like I said, I want to exclude specific nodes, not to specify on what exact nodes I want my job to run.


#12

It seems without admin’s help, this cannot be achieved.
Specifying the exact nodes seems to be best workaround for now


#13

I realize that the node that I want to exclude already has some properties different from the other, and I wonder whether these can be used to exclude it, see this post.


#14

Please try

qsub -l  properties=diag  -- /bin/hostname
qsub  -l  nodes=1:ppn=2:diag -- /bin/hostname
qsub -l  nodes=1:ppn=1:properties=diag -- /bin/hostname

#15

When dealing with the same problem, I found that a simple way to exclude a malfunctioning node is simply to create a dummy job that requests all of its resources (ppn, mem).

Example:
qsub -l nodes=<node>:ppn=<ppn> -l mem=<mem> -N <job_name> -- date

While the system is busy trying to run the dummy job on the undesired node, I can submit the actual jobs that I want to run on other nodes.


#16

Hi Adarsh,

  Another Torque user... you are becoming a Torque expert.

        Lightbulb!  Now I have an idea: I need fre              sh info on how to

migrate from other workload managers to PBS,

                                  and if you

ever get the time, I’d love to get instructions on
migrating from Torque to PBS. But this is
not yet

                                  a pressing

matter. Thanks. (Evil grin.)

                  Cheers,

                    -Anne