GPU memory as a custom resource


#1

Hello,

I’m now setting up PBS Pro 17.1.0 in our CentOS 6 cluster with GPUs.
I’d like to have different users share the same GPU (i.e. run multiple processes simultaneously using the same GPU) as long as the GPU’s memory is available. The user is requested to specify the amount of GPU memory they use. I thought this can be realized by defining the GPU memory as a custom static vnode-level consumable shared resource.

Let’s assume that we have one host with two GPUs, each with 6GB memory. I’ve defined one vnode per one GPU:

  • host01
    – vnode0: 6GB available
    – vnode1: 6GB available

When a user submits two jobs each requesting 4GB memory (e.g. doing “qsub ./foo.sh -l gpumem=4gb” twice), then they should be allocated to GPU0 and GPU1 respectively:

  • host01
    – vnode0: 4GB assigned to Job1, 2GB available
    – vnode1: 4GB assigned to Job2, 2GB available

In my current configuration, however, the jobs are actually allocated spanning multiple GPUs:

  • host01
    – vnode0: 4GB assigned to Job1, 2GB assigned to Job2
    – vnode1: 2GB assigned to Job2, 4GB available

So the question is: how can I configure a consumable shared resource (representing GPU memory) so that a job does not span multiple vnodes?

My current configuration is as follows:

# cat /var/spool/pbs/server_priv/resourcedef
ngpus type=long flag=hn
gpumem type=size flag=hn

# grep '^resources:' /var/spool/pbs/sched_priv/sched_config
resources: "ncpus, mem, arch, host, vnode, aoe, eoe, ngpus, gpumem"

$ pbsnodes -av
host01
     Mom = host01
     Port = 15002
     pbs_version = 17.1.0
     ntype = PBS
     state = free
     pcpus = 8
     resources_available.arch = linux
     resources_available.host = host01
     resources_available.mem = 65936176kb
     resources_available.ncpus = 8
     resources_available.ngpus = 0
     resources_available.vnode = host01
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.gpumem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.ngpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     license = l

host01[0]
     Mom = host01
     Port = 15002
     pbs_version = 17.1.0
     ntype = PBS
     state = free
     pcpus = 1
     resources_available.arch = linux
     resources_available.gpumem = 6077mb
     resources_available.host = host01
     resources_available.ngpus = 1
     resources_available.vnode = host01[0]
     resources_assigned.accelerator_memory = 0kb
     resources_assigned.gpumem = 0kb
     resources_assigned.mem = 0kb
     resources_assigned.naccelerators = 0
     resources_assigned.ncpus = 0
     resources_assigned.ngpus = 0
     resources_assigned.vmem = 0kb
     resv_enable = True
     sharing = default_shared
     in_multivnode_host = 1
     license = l

host01[1]
     <same as above>

Any comments and suggestions would be deeply appreciated. Thank you for your cooperation!


#2

The way to handle this would be to use the cgroups hook. it can add the devices to cgroup and the job would then only have access to the resources that it requested. For CentOS6 (It needs some modifications for CentOS7 due to systemd changes) the cgroup hook that was checked in ~ last July should work for you. Please let me know if you that work for you


#3

Hi Jon,

Thank you for your kind advice.
Following your advice I’ve tried to use pbs_cgroups hook in addition to my current configuration.
As a result, in short, it did not work.
After setting up the pbs_cgroups hook (config shown below), I did “qsub ./foo.sh -lselect=ngpus=1” then the job did not run, just stayed in the queue forever.

Honestly speaking, I’m wondering why cgroups could be a solution to my problem.
What I want to do is not to separate an assigned resource from other jobs but to confine an assignment of a (single kind of) consumable resource to a single vnode.
I would be happy if you could give me a concrete idea to cofigure pbs_cgroups for my purpose.
Any workarounds other than pbs_cgroups would also be appreciated.

My current configuration is as follows (in addition to those in the first post):

# qmgr -c "list hook"
Hook pbs_cgroups
    type = site
    enabled = true
    event = execjob_begin,execjob_epilogue,execjob_end,execjob_launch,
        execjob_attach,
        exechost_periodic,
        exechost_startup
    user = pbsadmin
    alarm = 90
    freq = 120
    order = 100
    debug = false
    fail_action = offline_vnodes

# cat /var/spool/pbs/server_priv/hooks/pbs_cgroups.CF
{
    "cgroup_prefix"         : "pbspro",
    "exclude_hosts"         : [],
    "exclude_vntypes"       : ["no_cgroups"],
    "run_only_on_hosts"     : [],
    "periodic_resc_update"  : true,
    "vnode_per_numa_node"   : false,
    "online_offlined_nodes" : true,
    "use_hyperthreads"      : true,
    "cgroup" : {
        "cpuacct" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : []
        },
        "cpuset" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : []
        },
        "devices" : {
            "enabled"         : true,
            "exclude_hosts"   : [],
            "exclude_vntypes" : [],
            "allow"           : [
                "b *:* rwm",
                "c *:* rwm",
                ["nvidiactl", "rwm", "*"],
                ["nvidia-uvm", "rwm"]
            ]
        },
        "hugetlb" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : [],
            "default"         : "0MB",
            "reserve_percent" : "0",
            "reserve_amount"  : "0MB"
        },
        "memory" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : [],
            "soft_limit"      : false,
            "default"         : "256MB",
            "reserve_percent" : "0",
            "reserve_amount"  : "1GB"
        },
        "memsw" : {
            "enabled"         : false,
            "exclude_hosts"   : [],
            "exclude_vntypes" : [],
            "default"         : "256MB",
            "reserve_percent" : "0",
            "reserve_amount"  : "1GB"
        }
    }
}

Mom’s log after doing “qsub ./foo.sh -lselect=ngpus=1” looks like:

cat /var/spool/pbs/mom_logs/20171011

10/11/2017 18:46:27;0002;pbs_mom;Svr;Log;Log opened
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;pbs_version=17.1.0
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;pbs_build=mach=N/A:security=N/A:configure_args=N/A
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;hostname=host01;pbs_leaf_name=N/A;pbs_mom_node_name=N/A
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface lo: localhost localhost.localdomain localhost4
 localhost4.localdomain4
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;ipv4 interface bond0: host01
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;ipv6 interface lo: localhost localhost.localdomain localhost6
 localhost6.localdomain6
10/11/2017 18:46:27;0100;pbs_mom;Svr;parse_config;file config
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address ***.***.***.*** as authorized
10/11/2017 18:46:27;0002;pbs_mom;n/a;set_restrict_user_maxsys;setting 499
10/11/2017 18:46:27;0100;pbs_mom;Svr;parse_config;file /var/spool/pbs/mom_priv/config.d/host01.vnodes
10/11/2017 18:46:27;0002;pbs_mom;n/a;read_config;max_check_poll = 120, min_check_poll = 10
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP set to use reserved port authentication
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Initializing TPP transport Layer
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Max files allowed = 204800
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);TPP initialization done
10/11/2017 18:46:27;0c06;pbs_mom;TPP;pbs_mom(Main Thread);Single pbs_comm configured, TPP Fault tolerant m
ode disabled
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);Connecting to pbs_comm host01
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Adding IP address 127.0.0.1 as authorized
10/11/2017 18:46:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Registering address ***.***.***.***:15003 to pbs_comm
10/11/2017 18:46:27;0c06;pbs_mom;TPP;pbs_mom(Thread 0);Connected to pbs_comm host01
10/11/2017 18:46:27;0002;pbs_mom;Svr;set_checkpoint_path;Using default checkpoint path.
10/11/2017 18:46:27;0002;pbs_mom;Svr;set_checkpoint_path;Setting checkpoint path to /var/spool/pbs/checkpo
int/
10/11/2017 18:46:27;0086;pbs_mom;Svr;pbs_mom;Found hook pbs_cgroups type=site
10/11/2017 18:46:27;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_alps_inventory_check type=pbs
10/11/2017 18:46:27;0086;pbs_mom;Svr;pbs_mom;Found hook PBS_power type=pbs
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[0] = {pbs_cgroups, order=100, type=0, enabl
ed=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,e
xecjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[1] = {PBS_alps_inventory_check, order=1, ty
pe=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;ALLHOOKS hook[2] = {PBS_power, order=2000, type=1, enable
d=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,
exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_begin hook[0] = {pbs_cgroups, order=100, type=0,
enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_lau
nch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_begin hook[1] = {PBS_power, order=2000, type=1, e
nabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob
_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_prologue hook[0] = {PBS_power, order=2000, type=1
, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,exec
job_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_launch hook[0] = {pbs_cgroups, order=100, type=0,
 enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_la
unch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_epilogue hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_epilogue hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_end hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_end hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[0] = {PBS_alps_inventory_check, order=1, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(exechost_periodic), alarm=90, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[1] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_periodic hook[2] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_startup hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;exechost_startup hook[1] = {PBS_power, order=2000, type=1, enabled=0 user=0, debug=(0) fail_action=(1), event=(execjob_begin,execjob_prologue,execjob_epilogue,execjob_end,exechost_periodic,exechost_startup), alarm=180, freq=300}
10/11/2017 18:46:27;0080;pbs_mom;Hook;print_hook;execjob_attach hook[0] = {pbs_cgroups, order=100, type=0, enabled=1 user=0, debug=(0) fail_action=(2), event=(execjob_begin,execjob_epilogue,execjob_end,execjob_launch,execjob_attach,exechost_periodic,exechost_startup), alarm=90, freq=120}
10/11/2017 18:46:27;0002;pbs_mom;n/a;ncpus;hyperthreading enabled
10/11/2017 18:46:27;0002;pbs_mom;n/a;initialize;pcpus=24, OS reports 24 cpu(s)
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Hook name is pbs_cgroups
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Event type is exechost_startup
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Hook utility class instantiated
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;__get_vnode_type: Could not determine vntype
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Cgroup utility class instantiated
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;GPUs: {'nvidia0': '0000:04:00.0'}
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;create_vnodes: vnode_per_numa_node is disabled
10/11/2017 18:46:27;0100;pbs_python;Hook;pbs_python;main: Hook handler returned success
10/11/2017 18:46:27;0080;pbs_python;Hook;pbs_python;Elapsed time: 0.4408
10/11/2017 18:46:27;0006;pbs_mom;Fil;pbs_mom;Version 17.1.0, started, initialization type = 0
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Mom pid = 11699 ready, using ports Server:15001 MOM:15002 RM:15003
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Main Thread);net restore handler called
10/11/2017 18:46:27;0002;pbs_mom;Svr;pbs_mom;Restart sent to server at host01:15001
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest ***.***.***.***:15001, msg="tfd=18, pbs_comm:***.***.***.***:17001: Dest not found"
10/11/2017 18:46:27;0d80;pbs_mom;TPP;pbs_mom(Thread 0);sd 0, Received noroute to dest ***.***.***.***:15001, msg="tfd=18, pbs_comm:***.***.***.***:17001: Dest not found"

(Note: IP addresses are masked)

Thank you for your kind cooperation!


#4

In re-reading your description above the current cgroup hook would not be a solution if you want users to share the same gpu. Now as to why your job is not starting that I am not sure. I would need additional logs (add line to mom config: logevent 0xffffffff) to determine what is happening.


#5

Hi jon,

Thank you for your concern and I’m sorry for my delayed reply.

As for my initial problem, I’ve written my own hook that appends “-lplace=shared:group=vnode” when a user requests gpumem, and I’ve set “sharing = default_excl” for each vnodes. It works as I expected. :smiley:

A remaining concern is that a user can only request gpumem on a single GPU because the placement is confined to a single vnode with this setting.
There seems no way to request some partial resources on different vnodes, e.g. 2GB on GPU1 and 4GB on GPU2.
We have currently no real use case of this kind, so I’m OK with it, but I’m curious whether there is a workaround or this is an essential limitation of PBS Pro.

As for the issue that my job was not starting with cgroups, I’m sorry but I lose the logs and I have no time to reproduce it.
I’ll report it in detail next time I have the same problem.

Best,


#6

Just a quick note, -l place=group=vnode will force a job to be placed on a single node. While that will work, it makes the scheduler to do more work than it needs to. The suggested method to run a job on a single node is -lplace=pack.

Bhroam


#7

Hi bhroam,

Thank you for your comment and I’m very sorry for my long-delayd reply!

According to Administrator’s Guide 4.8.6 (AG-97) I guessed “place=pack” places a job on a single host, while I wanted to place a job on a single vnode.
Though I’m satisfied with my current config, I would experiment with your suggested settings later.

Best,