How to query the number of available cores to your job


#1

Dear folks I was wondering if it would be possible to let me know how to query the number of available jobs to your batch job after you’ve set up your nodes with something like #PBS -l nodes=3,cores=24.

I know there’s pbsnodes -a but I’m not sure if there anything like pbscores I’m expecting to return the number of nodes that I’ve requested for this particular job which is 24.

Thanks!


#2

Please note the syntax: nodes and cores are old syntax, please update your scripts to use select,ncpus,mem,mpiprocs etc.

#PBS -l select=3:ncpus=24
Which means you are requested 1 chunk with 24 cores .
In total you are requesting 72 cores for this job.

Please follow the quick start guide : https://www.pbsworks.com/pdfs/PBSQuickStartGuide18.2.pdf

The below commands might also help you to get the information you need:

  1. pbsnodes -aSj
  2. qstat -fx jobid
  3. qstat -answ1

#3

Thank you so much for the response and the link. I’ll definitely have a good look at them. I was wondering if you could point me out to a similar guide or material which shows how to use spark with PBS?


#4

pbsnodes -aSj seems to be an invalid commnad

-bash-4.1$ pbsnodes -aSj
pbsnodes: invalid option – ‘S’
pbsnodes: invalid option – ‘j’
usage: pbsnodes [-{c|d|l|o|p|r}] [-s server] [-n] [-N “note”] [-A “append note”] [-q] [-m standby|suspend|sleep|hibernate|shutdown] node …
pbsnodes [-{a|x}] [-s server] [-q] [node]


#5

Please note the command is valid for PBS Pro OSS . Please share the version of the PBS Pro you are running.

This is what i see on my system :

> root@pbsserver ~]# pbsnodes -aSj
>                                                         mem       ncpus   nmics   ngpus
> vnode           state           njobs   run   susp      f/t        f/t     f/t     f/t   jobs
> --------------- --------------- ------ ----- ------ ------------ ------- ------- ------- -------
> cn1     state-unknown        0     0      0      0kb/0kb     0/0     0/0     0/0 --
> cn2     state-unknown        0     0      0      0kb/0kb     0/0     0/0     0/0 --
> cn3         free                 0     0      0    23gb/23gb     4/4     0/0     2/2 --
> 
> [root@pbsserver ~]# pbsnodes --version
> pbs_version = 18.1.0

#6

@adarsh Hi and thank you so much for your help. My apologies for the late reply.
Here’s the version of pbs running on the cluster

-bash-4.1$ pbsnodes --version
Version: 6.0.0.1
Commit: 21cc3d8b18424c77871cd95265148deb97993d14

#7

Please note , the flavour and version of the scheduler is not PBS Pro OSS.
The PBS Pro OSS version history 13.x.x , 14.x.x and the latest is 18.1.2
Could you please check whether you have downloaded the software from the below link

https://www.pbspro.org/Download.aspx#download

or


#8

Hi @adarsh thank you for your clarifications. I don’t have any admin access on this system, I haven’t installed anything. This is a shared environment which is using SGE and pbs but to be honest I don’t know if it’s PBSPro. The other question that I have is whether PBSPro and PBS are different pieces of software?


#9

Both are the same piece of software with respect scheduling, workload management,job management, but :
PBS Professional Commercial - licensed - www.pbsworks.com
PBS Professional OSS – open source - www.pbspro.org

Please run these commands on the headnode and check the output:
rpm -qa | grep pbs
rpm -qa | grep sge
rpm -qa | grep torque


#10

I should have noticed this, your cluster is not running PBS Pro OSS but it is running torque.
Please use the above snippet and search on google.

Refer these discussions:



#11

@adarsh thanks so much. The commands don’t output anything. I suppose that torque is commercial since it’s from adaptivecomputing company. Torque is installed in:

 bash-4.1$ which pbsnodes
/opt/torque/torque/bin/pbsnodes

#12

Thank you @kirk for this information. This answers everything related to information disposed by pbsnodes


#13

@adarsh thank you for your patience with all my stupid questions. If you don’t mind me asking one last stupid question.
I have the following scenario where I want to run a spark job across N nodes.
In order to do that I have to login to node 0 and set that as the master by executing a particular spark command and finally login the rest N-1 nodes and make them the workers by executing another spark command.

The way that I’ve found is using pbsdsh.

  1. Is there any other way to login to nodes selectively, doing simple ssh doesn’t work since it requires a password and apparently is different than the password that I use to ssh into the login node of the cluster.

  2. The main issue that I have with pbsdsh is that it doesn’t forward any environment variables into the nodes that you try to execute a command, is there any way around this?

Minimal working example:

#!/bin/bash -l 

#PBS -l nodes=1:ppn=12

# set the walltime in hours:minites:seconds
#PBS -l walltime=00:30:00

#PBS -N TestJob

# join the standard output and error files
#PBS -j oe

# send me email when job begins (b), ends (e) and/or aborts (a)
#PBS -m bea

# inherit the current environment
#PBS -V

#load a module file 
module purge
module load Miniconda
source activate conda-environment

cd ${PBS_O_WORKDIR}

pbsdsh -n 0 -- /bin/bash -c "module load Miniconda; source activate conda-environment; which python;"

exit 0

Output from pbsdsh command:

module: command not found
activate: No such file or directory
/usr/bin/python

What am I missing here? Am I doing something wrong?

Many thanks in advance?


#14

#15

+1 @mkaro

No problems @kirk , please keep them coming. It would good to see PBS Pro OSS ( in the near future ) on your cluster :slight_smile:

Please try the below

  1. Save the above script in a common location accessible by all the compute nodes (main_pbs.sh)
  1. Save the list of commands to be called by pbsdsh in another script and save it in the same location
    (client_pbs.sh)
    cat pbs.sh
    #!/bin/bash -L
    module load Miniconda
    source activate conda-environment
    which python
    hostname

3.Add the below line (lines in case of multinode )
pbsdsh -n 0 – /bin/bash -c “/shared/path/to/client_pbs.sh” # this will execute on the mother superior

#pbsdsh -w node2,node3 “/shared/path/to/somescript” node2,node3 is the list of sister nodes which
excludes mother superior, you need to create this from $PBS_NODEFILE

Password-less SSH need to be configured for all the users , it should work between server to node, node to server and node to node ( stricthostkeychecking should be turned off ) . This is a basic requirement.


#16

@adarsh Thank you! Your suggestion helped me a lot. I also found that it is possible to pass environment variables while you’re submitting the job using qsub -V BASH_ENV=~/.bash_profile “your job”;

I managed to achieve what I wanted by this this simple for loop after your suggestions:

i=0
for node in $UNIQ_NODES; do
    if [ $i == 0 ]; then
        pbsdsh -u -h $node "$PBS_O_WORKDIR/master.sh"
    else
        pbsdsh -u -h $node "$PBS_O_WORKDIR/worker.sh"
    fi
    i=$((i + 1))
done

After the above snippet there’s a simple python script to verify if things have been setup properly

python verify.py

There’s one thing that still puzzles me. It seems that the for loop get executed properly and things are setup as they should, but the python script is never executed. The job get’s cancelled since it reaches the requested wall time and after debugging the messages it seems everything to be set up properly except from the fact that the python script is never executed.

Any ides on why that might happen?


#17

Nice one ! I like your script :slight_smile:
I hope you got these $UNIQ_NODES from $PBS_NODEFILE

quote=“kirk, post:16, topic:1185”]
qsub -V
[/quote]
FYI:
Using qsub -V is not a best practice. If the environment variable list set on the system is too huge, then it is too much of information to store in the pbs datastore. Better to use qsub -v VAR1=a,VAR2=b,VAR3=c or have specific variable set in the script.

Please try : verify is in the location of master.sh
/usr/bin/python $PBS_O_WORKDIR/verify.py

Note: Make sure any script that you would like to execute is accessible by all the nodes and is available in the same working director.

If you submit a job without a walltime , then the default walltime set on the job(s) is 5 years


#18

Thanks @adarsh

exactly after piping into uniq.

That is good to know. Thank you!

Indeed verify.py is in the same folder as master.sh . Only thing that I can see is that I forgot to prepend $PBS_O_WORKDIR. Would that explain though that is not being executed?

Pretty much all the scripts are in the same directory, a.k.a $PBS_O_WORKDIR from which I launch the job.

I’d only wish, the max we get at the uni. cluster is 15 days.

Let me give it a try with $PBS_O_WORKDIR prepended and get back in case of any updates.