We are seeing a lot of errors that looks like this in the mom logs:
pbs_mom; Processing error in pbs_cgroups handling execjob_begin event for job
CgroupProcessingError (‘Failed to assign resources’,)
They were discovered when a lot of jobs were being put into H state after trying and failing a number of times to be launched.
What’s going on here? Why is the scheduler giving jobs to machines that don’t have the resources and how can we stop it?