Large job arrays fail to be completely scheduled


#1

Hi,

I’m submitting large (10^3, 10^4) job arrays. At some point during the scheduling process it stalls, with the following messages in the scheduler log:

05/23/2017 23:29:08;0040;pbs_sched;Job;28[8453].ip-172-16-255-9;Job run
05/23/2017 23:29:08;0080;pbs_sched;Job;28[].ip-172-16-255-9;Considering job to run
05/23/2017 23:29:08;0040;pbs_sched;Job;28[8454].ip-172-16-255-9;Job run
05/23/2017 23:29:08;0080;pbs_sched;Job;28[].ip-172-16-255-9;Considering job to run
05/23/2017 23:29:08;0040;pbs_sched;Job;28[8455].ip-172-16-255-9;Job run
05/23/2017 23:29:08;0080;pbs_sched;Job;28[].ip-172-16-255-9;Considering job to run
05/23/2017 23:29:08;0040;pbs_sched;Job;28[].ip-172-16-255-9;Failed to run: Request invalid for state of job (15016)
05/23/2017 23:29:08;0080;pbs_sched;Job;28[].ip-172-16-255-9;Considering job to run
05/23/2017 23:29:08;0040;pbs_sched;Job;28[].ip-172-16-255-9;Failed to run: Request invalid for state of job (15016)
05/23/2017 23:29:08;0080;pbs_sched;Job;28[].ip-172-16-255-9;Considering job to run

The “Failed to run: Request invalid for state of job (15016)” messages goes on …

Has anyone seen this before?


#2

Update:
This appears to have been fixed by this commit:
https://github.com/PBSPro/pbspro/pull/264/commits

I don’t see the issue using a build off the master branch.