Cannot delete Job after Checkpoint/Restart


#1

Hi,

I was testing checkpoint/restart on 18.1.2 and MoM was segfaulting during job delete while job was already in E state.
I ran into several other issues too. Going through GitHub issues, some of them seemed to have fix and been merged into master branch.



Therefore I decided to test master branch. Built the packages for CentOS 7.5.1804 and tested checkpoint/restart again. MoM does not segfault anymore, but after checkpointing/restarting the job cannot be deleted, and seems to be registered on server for ever. Even if processes on execution node are cleaned up. Way to delete (qdel) job would be to stop MoM, qdel -W force, and start MoM again. Another way that is sometimes working would be:

while : ; do qdel JOBID ; done

I used following repo https://github.com/scottaltair/PBS-Professional-CPR-Example.git to test checkpoint.
Only change I did was in checkpoint(_abort).sh script and changed “kill -SIGTSTP” to “kill -TSTP” line.

Can someone test this and confirm? Could it be that I found another Bug?

Thanks


#2

Hi @mae,

Could you please pull the changes from here and let us know if the changes work for you. If yes, we will be merging the fix for the issue after the code review.

If not, we would need to create a new ticket.

Thanks,
Prakash


#3

Hi @prakashcv13,

I just pulled those two commits, and built packages. It seems that they fixed behavior that I saw last week.
At least I can’t reproduce it anymore.

Thanks,
MaE


#4

good to know that @mae.