PP-516: Direct write of job's stdout/err files


#1

Hello,

This topic is to inform the community about design document for direct write of job’s stdout/err files now available at location:
https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=51901651

Please review and post your comments here.

Thanks,
Nithin.


#2

Hi Nithin,

I see that we are introducing only 2 new interfaces, interface 3 looks more like a description of behavior and provides example.

Thanks,
Prakash


#3

Agreed. Interface3 is induced by “-d” option. Better to put them together.


#4

Hello Nithin, great work! I have a few comments:

  • Interface 1 mentions a log message if direct write was requested but the path(s) are not usecp-able from the primary execution host, shouldn’t that log message be an additional separate interface in the EDD?

  • Following on from above, in addition to having mom log a message I think we should consider writing the same message to the job’s stderr stream so that the job submitter has a chance to see it as well and correct their submissions in the future (it is highly unlikely they’ll be looking in the mom logs). If you add this please be sure to update the example scenarios to mention it.

  • In interface 2 you say:
    When sandbox is used with -k option, it will delete .o and .e files from hosts. Going forward, deletion will happen only by using "-d" option, not by "-k" option as It is not expected to delete using "-k" option when it stands for "Keep_Files".
    I had not previously been aware of this behavior. Do you know if this behavior is documented anywhere?

  • It issomewhat confusing to be introducing 2 new interfaces both called “d” (even though one is a modifier for qsub -k and the other is a stand-alone qsub -d option). Maybe change -d to -R (for Remove)?


#5

torque already has this option (by setting: $spool_as_final_name true ).
could you confirm that in the latest version of pbspro, stdout and stderro files would be moved back to ${PBS_O_WORKDIR}/ only when the job completes?

thanks,

Sue


#6

I’ve created another interface for the message logging. I think it will be useful to write the warning into stderr file. Also changed the option to remove files from "-d "to “-R”. You can review the changes.[quote=“scc, post:4, topic:544”]
Do you know if this behavior is documented anywhere?
[/quote]
You can find this in Table 4-3
Keep_Files attribute Determines whether output and/or error files remain on execution host. User-settable per job via qsub -k or through a PBS directive. If the Keep_Files attribute is set to o and/or e (output and/or error files remain in the staging and execution directory) and the job’s sandbox attribute is set to PRIVATE, standard out and/or error files are removed, when the staging directory is removed at job end along with its contents.


#7

AFAIK pbspro follows spooling and then staging model. Except when used “-k” option to direct write into user’s home directory.


#8

in PBS Professional 14.2 Administrator’s Guide,

10.14.9.1 Output and Error with Job-specific Staging and Execution Directories
f the qsub -k option is used, the stdout and stderr files will not be automatically copied out of the staging and execution directory at job end; they will be deleted when the directory is automatically removed.

10.14.9.2 Output and Error with User Home Directory as Staging and Execution Directory
If the -k option to qsub is used, standard out and/or standard error files are retained on the primary execution host instead of being returned to the submission host, and are not deleted after job end.

they are confusing.

I tried it like this,

/usr/physics/pbspro/bin/qsub -k pbs-pbspro-1.csh

it got stuck and job is not being submitted. in PBS command options for qsub, I cant find option -k.


#9

You can find the usage of “-k” in qsub man pages. It has to be used with o/e,oe.

               e    The standard error stream is retained on the execution host, in the job's staging and execution directory.  The filename is
                     job_name.e<sequence number>

           o    The standard output stream is retained on the execution host, in the job's staging and execution directory.  The filename is
                     job_name.o<sequence number>

           eo, oe
                Both standard output and standard error streams are retained on the execution host, in the job's staging and execution directory.

If you are using “-k” option alone with qsub, you should receive a usage error. I think -n option is not that relevant since you can achieve the same without using -k altogethor.


#10

Yes. They will get deleted when you use job specific staging and execution directories. Applicable when used along with qsub -Wsandbox=PRIVATE option. A change in this scenario is mentioned in interface2.

When sandbox is used with -k option, it will delete .o and .e files from hosts. Going forward, deletion will happen only by using "-R" option, not by "-k" option as It is not expected to delete using "-k" option when it stands for "Keep_Files".

This will be the case if you do not use any job specific staging directory. This is the default scenario when “-k” is used with qsub.


#11

How will this work in conjunction with the qsub -j parameter? If the user specifies -joe then both the stdout and stderr get streamed to the .o file. What if -ke is specified in this case? Is it an error or silently ignored?


#12

Hi @mkaro, currently PBS silently ignores it, which I do not think we should change:

[user1@centos7 ~]$ echo "sleep 1" | qsub -joe -ke
8.centos7
[user1@centos7 ~]$ echo "sleep 1" | qsub -joe -keo
9.centos7
[user1@centos7 ~]$ ls -lrt | tail -n3
-rw-------. 1 user1 user1     0 Jun  2 16:24 STDIN.o7
-rw-------. 1 user1 user1     0 Jun  2 16:25 STDIN.o8
-rw-------. 1 user1 user1     0 Jun  2 16:25 STDIN.o9
[user1@centos7 ~]$

#13

Sounds fine to me, but we may want to acknowledge that this is the case.


#14

Hi Nithin, I agree it is initially somewhat confusing for the combination of -koe and -Wsandbox=PRIVATE to result in the .o and .e files being deleted, but and it does make sense and I don’t see a compelling reason to change it with this RFE.

With or without -koe, a private sandbox job creates the .o and .e files in the sandbox while the job is running (rather than in $PBS_HOME/spool).

If -koe is not used with a private sandbox job .o and .e files are staged to the Error_Path and Output_Path when the job finishes.

When -koe is used with a private sandbox job I believe the “keep” is taken to mean “keep the files in the sandbox where they were while the job was running”, which of course gets removed once the job is finished. “qsub -koe -Roe -Wsandbox=PRIVATE” would just do the same thing as though -Roe had not been specified, though of course the files will be removed whether the job was successful or not. If that’s not what you wanted, don’t use -koe If we changed this behavior then the meaning of “keep” will have been changed (we’ll now stage the files even though -k says not to) with no real benefit since the user can already have it either way they want it.

P.S., as I was thinking about this I was worried that the current Interface 1 proposal using -k violates the existing definition of “keep_files”, but I think I talked myself out of it, here’s how. We currently define “keep_files” as “Determines whether output and/or error files remain on execution host”. The current EDD interface 1 using -koed to enable direct write adheres to this definition in spirit if not actual language since we are using it to say not to stage the files as a separate step, even though the location they were directly written to may not actually be a local filesystem.


#15

And thank you for incorporating the previous feedback! I updated my post directly above to change “doe” to “Roe” to reflect the current EDD proposal, sorry I had missed these revisions before posting.


#16

I believe the EDD needs to also specify how the new job attributes and possibly how the new values for existing attributes will appear in qstat -f. I propose:

Keep_Files = deo

Basically exactly what was specified is shown, doe, deo, etc., just like oe or eo today

and:

Remove_Files = oe

Same as above, but the Remove_Files attribute is new.


#17

https://pbspro.atlassian.net/wiki/pages/viewpage.action?pageId=51901651

Interface 1: New option to output (stdout/stderr) files go to the final destination, instead of being staged, if the final destination is known to be writable from the job execution node.

apparently, it is a practically very useful function for users. it allows users easier to monitor jobs status via stdout/stderr files as it runs. Torque has had this function for long time.

Sue


#18

I’ve added this scenario as an example. Thanks!


#19

On an end user’s perspective, the only difference between the usage of -Wsandbox=PRIVATE with and without -koe is that, files are getting deleted when -koe option is used and not otherwise. This might puzzle the user as k stands for Keep_Files.

We need to support -Roe with all the options including -Wsandbox=PRIVATE. So, if -Roe is used, the files will get deleted upon successful completion of a job and what we are achieving with -koe is similar.

In the current proposal, -koe does not have any meaning when used along with -Wsandbox=PRIVATE because I believe the similar effect can be obtained by using -Roe. -Wsandbox=PRIVATE used with and without -koe is the same.

The only question is, do we need to preserve the meaning of -koe with sandbox? This is not required if it does not harm the users(as we are providing an alternative). Please share your views.


#20

Thanks for the catch! Updated the EDD with this.