Subjobs in job array failed randomnly


#1

Hi All

I am submitting array jobs with 80 subjobs in them.
In the job script, each subjob will read a line from the assigned text file for the parameters used in the Matlab script, change these value in the .m file with ‘sed’, and execute it with Matlab.

However, I am getting random errors which is some subjobs will fail almost instantaneously when it is started but with no error report. Status from Q->X. When I deleted the whole job array and resubmit the job array, the previous failed subjobs may start to run smoothly, but others which previously start smoothly would fail.

Does anyone know what may be related to this?
I think it is less likely that the parameters or the matlab code caused the error, because I tried some of the failed parameters and they could be ran smoothly on my laptop or on the cluster without error when I left the surviving ones to be completed.

Thank you very much!

Best wishes,
Chen-Chia


#2

If you could run this with echo or print scripts , that will output the value ( input value) the sub job has got to process, then we would know that none processing part of the workflow is working fine.

Later couple it with your application, then you can conclude the issue in the workflow.