Preemption optimization - phase 1


#21

Thanks for explaining Bhroam, I think it might be worth exploring how often Step 1 fails (preemption of jobs the first time). If for most cases, Step 1 passes in a single try, then it might be worth optimizing for the common case.

If we want to maintain the robustness, here’s a bad idea that you will hate: how about when Step 1 fails, the server sets an invisible attribute on the highp job to the list of jobs that it could not preempt, so that in the next cycle when the scheduler will try to preempt jobs for the highp job again, it will ignore the jobs set on that attribute.

We could also try to just preempt the job again without any information from the past (except maybe a max_preempt_attempts), maybe the world changes enough that we can find the right set for it next time, or maybe it’ll just run the next time. So, we could try doing a POC to see if how the wait time of highp jobs gets affected if we simplify things for better performance.


#22

Has there been any analysis into how often Step 1 fails? Was the POC Ravi mentions ever tried?

@Bhroam I’m not sure I understand how this is less “robust” than running regular jobs. In either case if there is a failure, whether it be a node down for a regular job or a job that fails to be preempted, the job that’s trying to run goes back in the queue until the next cycle.


#23

In the case of running jobs, the scheduler does not know the job did not run until the next cycle. It gets returned to the queue on the server. The scheduler doesn’t know. For running jobs, this is perfectly fine. All that matters is that it assumes the resources used by the jobs are not available. We have less resources available to run jobs. No over subscription happens in this case.

In the case of preemption, we are going in the opposite direction. We are freeing up resources. If a preemption fails, then resources are still being used. If the scheduler just assumes all jobs were preempted correctly, it will run the high priority job on those still-used resources. If any of the low priority jobs failed to be preempted, we now have over-subscription.

As for analysis, I don’t believe this has made any further process in the last month. @prakashcv13 can correct me if I am incorrect.

Bhroam


#24

Hi @bhroam, @smgoosen - I am working on the POC to compare the before-and-after performance. The IFL that I am currently implementing is as per the design that is already posted. I will be able to post the findings by end of next week.

Coming to the question of analyzing how often step 1 fails, I have not worked on that because I feel that it is better to transfer only the logic of preempting the jobs to the server.


#25

I take it back, @arungrover is working on making the scheduler choose its set of preemption candidates faster. We occasionally work on scalability on the scheduler, and this time we’re working on preemption. It should be checked in soon.


#26

Hi All,

I finally have some information to share. After shifting the logic of preempting the jobs from the scheduler to the server, and implementing the new batch request as per the proposal, I see an improvement in the performance.

The implementation that I have done so far only suspends the jobs (yet to implement checkpointing and re-queueing).

The test that I performed submits a set of normal jobs that get preempted by one express job which uses all the ncpus in the complex. The test records the time taken to preempt the normal jobs. The number of normal jobs increases from 1 to 149.

I have attached the test script and the output to the Open Confluence page.

As seen the time taken to preempt the jobs has reduced by a good percentage.

Thanks,
Prakash


#27

Hey @prakashcv13

Thanks for doing the testing. I’m somewhat surprised we get a 3x speedup when preempting 150 jobs. I would have thought it would be smaller.

My suggestions are the following:

  1. don’t put preempt_order in the IFL call itself. Just move it from the sched_config file to the scheduler object. This way the server and the scheduler both have access to it.
  2. The IFL calls don’t return a string with data embedded in it. They return a structure. Either create a new structure, or return a batch_status. It would be a kind of hoaky version of the batch_status since you’d have to return whether or not the preemption failed or succeeded as an “attribute”.

#28

Hi @bhroam,

Thank you for going through the test and the results and the feedback. Below is my understanding -

As far as I have understood how preemption works, scheduler finds preempt_order for each job individually based on the preempt_order value in sched_config. The server would not need the configuration setting, but the order in which it would need to try preempting each job which the scheduler dynamically “calculated” at the time of running a high priority job.

The implementation that I have done so far is using a new structure in the union of the batch_reply structure.


#29

There is only one global preempt_order. It just affects a job differently depending how much time is left in the job. My suggestion was to move preempt_order to the server. It would most likely work best in the sched object. Once the server had the preempt_order, it could calculate how much time is left and find the correct preempt order for a job.

Can you update the document to how you plan to implement this?

Thanks,
Bhroam


#30

Hi @bhroam

I am not sure, if we would gain any value-add by doing this. According to me preempt_order is more of a scheduler parameter and your suggestion is to keep it as-is. If we just move the logic of determining the preemption order to the server, it will not be of any use as a scheduling configuration parameter. So, I am of the view to keep the logic in the scheduler itself.

Thanks,
Prakash


#31

The only purpose of preempt_order is to determine which preemption method is used for jobs to be preempted. If we go with your suggested method, the only thing the scheduler will use it for is to pass it to the server. Why have this pass-through? Why not move it to the server where it is needed? The scheduler won’t care any more once it isn’t determining what preemption method is used.

Bhroam


#32

Hi @bhroam,

I agree that the scheduler need not send the preempt_order in the request, however, the scheduler uses the preempt method to determine the state of the job in the “_end” functions so the response from does need to have the preempt_method.

Also, I propose that instead of making preempt_order a part of the scheduler object, why not move it completely to the server and make it something that can be set through qmgr?

Am I proposing something that would affect a lot of test scripts?

Thanks,
Prakash


#33

@prakashcv13
You are correct, the scheduler does need to know how the job was preempted. If it was suspended and restrict_resources_to_release_on_suspend is set, it needs to know not to release all the resources. If the job is checkpointed, it needs to know now to release its exec_vnode.

So yes, the return batch_status will need to tell the scheduler how the jobs were preempted.

I totally agree with moving preempt_order completely to the sched object. You will need to remember upgrades. There is code in the scheduler which basically does a qmgr (the internal IFL call of qmgr) of certain attributes during the first scheduling cycle. If preempt_order is in the sched_config file, you’ll want to set it in the sched object.

Bhroam