Trying to get CUDA_VISIBLE DEVICES set with hook


#1

I’ve set up Advanced Scheduling for GPUs as recommended by the 18.2 Admin Guide and the scheduling of the GPUs seems to be working, the environment variable CUDA_VISIBLE DEVICES is not being set, which I think it should be as indicated by section 16.5.1 of the Admin guide.

I’ve also setup a hook of the following form that is set to run on the exit

{
    "cgroup_prefix"         : "pbspro",
    "exclude_hosts"         : ["ds-rnd-gpu"],
    "exclude_vntypes"       : [],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : false,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : false,
    "ncpus_are_cores"       : false,
    "nvidia-smi"            : /usr/bin/nvidia-smi
    "cgroup" : {
        "devices" : {
            "enabled"         : true,
            "exclude_hosts"   : ["ds-rnd-gpu"],
            "exclude_vntypes" : [],
            "allow"           : [
                "b *:* rwm",
                "c *:* rwm",
                ["nvidiactl", "rwm", "*"],
                ["nvidia-uvm", "rwm"]
            ]
        },
    }
}

I’m pretty new to hooks, so I’m wondering if there is something more I need to do to make this happen.

Any advice would be appreciated.

P.S. I do see a message in the mom_log

09/21/2018 09:42:16;0004;pbs_mom;Act;get_wm;libmemacct.so.1 not found

Is that perhaps related?

Joe Hellmers


#2

Hi Joe,

Thanks for your questions. The message you’re seeing about libmemacct is benign. If you were on an Altix system where libmemacct.so is present, you wouldn’t see it. It may safely be ignored.

The cgroup hook sets CUDA_VISIBLE_DEVICES in the job’s environment. Your configuration looks correct, but some versions of nvidia-smi report information about the device IDs differently. There is another thread where this was discussed here: GPU Access Limited by CGroup

Take a look at the pbs_mom logs in /var/spool/pbs/mom_logs and see if there is anything helpful there. If not, you can increase the verbosity of the logs by adding a line to /var/spool/pbs/mom_priv/config that looks like this:
$logevent 0xffff

You will need to restart pbs_mom so that it rereads its configuration. Try running another job and see if the logs provide any clues. Feel free to post excerpts here if you need additional help.

Thanks,

Mike


#3

Thanks for the help Mike!

After setting the log verbosity higher I see the following message in the that indicates the hook isn’t working

09/21/2018 11:10:46;0400;pbs_mom;Hook;dsrndgpuhook;Hook has no script content. Skipping hook.

When I look in the file /var/spool/pbs/mom_priv/hooks/dsrndgpuhook.CF on the mom host I see exactly that I have above.

When I look at the file dsrndgpuhook.HK in the same folder on that server I see.

hook_name=dsrndgpuhook
event=execjob_launch

I’ve removed this hook and added it back in a few times already as well as reimported the json file a couple of times.

Is there some other process I need to do get the hook sorted out on the GPU compute nodes?

Joe


#4

Glad to help. While you found the directory where the hooks are stored, you should not modify those files. You must use qmgr to import/export hooks and manage their configuration so that the changes get propagated to your execution hosts where pbs_mom runs. See chapter 3 of the PBS Pro admin guide located here: https://www.pbsworks.com/SupportGT.aspx?d=PBS-Professional,-Documentation

There should be a pbs_cgroups hook already present…

# qmgr -c “list hook”
Hook pbs_cgroups
type = site
enabled = false
event = execjob_begin,execjob_epilogue,execjob_end,execjob_launch,
execjob_attach,
exechost_periodic,
exechost_startup
user = pbsadmin
alarm = 90
freq = 120
order = 100
debug = false
fail_action = offline_vnodes

If you need to re-import it, you may find it here: https://github.com/PBSPro/pbspro/tree/master/src/hooks/cgroups
In this case, you want the most up-to-date hook on the master branch.

Go ahead and fiddle around with it and let us know if you have questions.

Thanks,

Mike


#5

I don’t have a pbs_cgroups hook.

FYI, I’m actually using PBS Pro 14.2. Is this hook something that is only available with Open PBS 18.x?

I’ve stood up a very small Open PBS 18.2 VM cluster, but I don’t have any GPUs on that one.

Can I just copy the hook from the Open PBS cluster to the PBS Pro 14.2 cluster?

The listing you show has enabled = false. Should this hook be enabled?


#6

The cgroups hook configuration file you listed in your initial post lead me to believe you were using the cgroups hook supplied with the PBS Pro package. I made a poor assumption. In this case, if all you want to do is set CUDA_VISIBLE_DEVICES the cgroups hook may be overkill. You may still want to take a look at how it gets set in the cgroups hook. Search for CUDA_VISIBLE_DEVICES and you’ll see the commands that add it to the environment for the job. If you do choose to use the cgroups hook, it is documented in chapter 16 of the admin guide.

You should be able to take the current cgroups hook from master and run it on a 14.x installation, but you will need to import it as though you wrote it yourself. And don’t forget to import the hook configuration file as well.

Thanks,

Mike


#7

FYI - If you’re running PBS Pro 14.2, feel free to contact your technical support representative. That is a commercial version of the product.


#8

Getting the pbs_cgroup hook and configuring based on documentation did the trick.

Thanks a bunch.


#9

Glad to hear you got it working! :+1: