PP-1287: do not purge moved job from history before the job is finished


#1

Hello,

As requested in the PR, I have written the design doc for the issue.

My suggestion is to purge the M job after it is really finished on the target server.

First, It does not make sense - the M job is removed before it is actually finished. I understand that the job is still trackable through the tracking file but we should be able to qstat both the M job and the job on the target server during the whole job life cycle. For example, we implemented a web application that informs users about their jobs and it uses the API, but it is better for us to ignore M jobs at all because we can not rely on the M job being in the system.

Second, there is a bug with dependency: Knowing the job is still running, you should be able to add a new dependent job with -W depend=afterok:<moved jobid>. If you try to add this dependency after job_history_duration, the <moved jobid> is purged and the <moved jobid> is unknown to the dependency.

There is an imperfection in this solution, which is that the M job is not stored with job_history_enable = False. I think we should keep the M jobs to be a part of the history. I am not sure how to deal with it.

Please, provide your comments.

Vasek


#2

@vchlum Document looks good to me.
It may not need to go in the design doc but I think with this change in place, we should also add a check to purge the moved job at the place we set its substate to finished, provided it has exceeded the job history duration. Otherwise it may linger around until the next job history work task runs.
What do you think?


#3

The history work task runs every two minutes. I would not consider it necessary but it seems to be a good idea. OK, I agree.

I also have a question concerning the PTL test. I am not able to start the PTL test with two servers. I suppose the correct syntax to start it is: ‘pbs_benchpress -p servers=<server1,server2> …’ but there is only one server available:

self.logger.info("servers: " + str(self.servers.keys()))

shows:

2018-07-31 13:27:31,087 INFO servers: [‘took27’]

Using: ‘pbs_benchpress -p servers=server1:server2 …’ the server2 is added as a node to server1.

Are multiple servers supported in PTL?


#4

@vchlum Can you please try with self.servers.host_keys()? If you still face problem then please provide exact pbs_benchpress command and its output (i.e. -o file)


#5

I agree with you assessment of not considering the change given that history task runs every 2 mins.

For your PTL test I think you might have found a bug in PTL framework, if you see both the servers showing up as nodes then please try it like this -
-p servers=s1:s2,moms=s1
This will make sure that second server is not considered as the node.


#6

Thank you @arungrover and @hirenvadalia. The workaround ‘-p servers=s1:s2,moms=s1’ works like a charm.