Qstat -xf intermittent missing stime field?


#1

Roughly half (74 our of 146) of my jobs don’t get an ‘stime’ field reported via ‘qstat -xf’. I’ve spot-checked ~75% of all records in server_priv/accounting/ with some one-shot scripting and eyeballs and found no issues, omissions, or missing job records. All other information in the qstat output matches up just fine. Accounting and qstat contents agree with stats reported by the jobs themselves. The missing stime field does not correlate with the job type, size, duration, topology, core placement, mpi use, redirections, pbs options, use of qalter, rerunnable, etc. There are multiple instances of one job script run multiple times in succession, with one instance in the middle of the series exhibiting the missing ‘stime’ and the rest not.

Anybody else? Just me? Am I drinking too much (or not enough)?

(I can produce specific output after it’s sanitized, later tomorrow maybe.)


#2

I assume that this is for finished jobs(?)
I have not (yet) encountered your issue - just checked a bunch of jobs from yesterday, and they all had stime set.

For what it’s worth,

/Bjarne


#3

FWIW, if you find this unreliable in qstat -fx output you can of course also dig for the stime by hunting for the accounting file record for the starting of the job. Of course you can’t insert that into the server data structures, but still…

If you do a “tracejob”, do you see anything different about the sequence of events for the jobs that have missing stime vs. those that don’t (like a first attempt to run on a node that aborts early and requeues the job, or something funny at the end of the job)?


#4

Yes, for finished jobs.


#5

Sorry, missed putting that in the original post. Yes, I’m aware of all three levels of logging (accounting, server, mom), and of tracejob. Nothing unusual to report there (not that there would be, if it’s not in the server or mom logs). These are not failing jobs with questions about zero walltime or requeue/rerun/hold cycles, they run nonzero duration and accumulate verifiably correct stats in the accounting logs, and qstat output agrees with those logs. It’s just that (a) stime goes missing (b) intermittently.

BTY I’ve been working with PBSPro on and off since mid 1999, Judith sat down the hall from me in the old TL300 here at ERDC when we were still running NQS on most of the big iron. I’m pretty familiar with it up through v12. But for v14… less so… :frowning:


#6

You’re not the only one to have seen it, but I don’t think I’ve ever been able to reproduce it or to identify what triggers this. You could strace the server and its children with -s 8192 to see what updates were posted to the datastore and hunt for updates when the job finishes that are missing the stime attribute. But my guess is that we’ll only find out exactly what it is by going in gdb and finding out what the internal data structures are when a job is finished and the final update is posted.

Of course, what would really help is to know whether stime is missing in qstat -f (without x) for these jobs --i.e. whether the error occurs when the job is started-- or whether stime is actually there while the job is running but disappears once the job is moved to the history. If you periodically grab qstat -f output that should be easy to determine.


#7

Just for the record - on a side note. I do not think it matters to add a ‘-x’ if the job is still running. (At least on my system, the result seems identical with or without -x). Of course, once the job has finished it matters.

/Bjarne


#8

Quite. But the point of not adding -x is that if you sometimes discover jobs in that output that are running and don’t have start time, you know the error occurs when the job starts and not when it’s moved to history.

If qstat -f never shows this but once the job finishes qstat -fx suddenly stops showing the stime attribute, then you know something happened when the job was moved to history.


#9

-x is not needed for running jobs.

I’ll see about qstat while the job is actually running.

On a related note I tried a quick one-shot compiled code with pbs_jobstat( connection, null, null, extend=‘x’ ), pretty-printed the resulting struct chain, and grepped that for a running job. The resulting struct did have the start time in it, but once the job completed that field was zeroed out. I’m probably doing this fetch with the wrong extend value, or pbs_jobstat is inappropriate for this, because it’s the same for all reported jobs and qstat clearly reports stime for ~50% of them. You wouldn’t happen to know which API call is appropriate for pulling all details for all jobs in the store, would you?


#10

That’s extremely strange, since (you can read the source for qstat.c !) that is exactly what qstat -fx is doing.


#11

What would be interesting is to see if the datastore update after the job is started always consistenly adds an stime attribute in the job_attr table. stime is saved both in a column in the job table but should also make it in a line in the job_attr table…

You can connect to the datastore using psql (-U pbsdata -p 15007 -d pbs_datastore). The password can be set using pbs_ds_password.

Once you’re connected, you can set search_path to pbs; and use \dt to display the tables and SQL commands to inspect what wassaved by the server to the datastore.

Of course an strace will also reveal all the INSERT and UPDATE commands, but it would be best to first see at what moment things go awry.


#12

Hi,

we are currently facing the very same issue - many running jobs are missing their stime.

Is there some progress on this issue?
Thank you!

Dalibor


#13

Unfortunately not. I could never reliably reproduce the issue and there
was a workaround so we moved forward with a change to our scripting to use
the accounting logs. Since then that project area, for us, has shut down
so I no longer have a sufficient test environment in which to chase the
issue. The workaround is to use the local accounting logs instead of
’qstat’ or pbs_jobstat() – when the stime component was dropped it did so
after having been successfully written to the
{PBS}/server/server_priv/accounting/{YYYYMMDD…} records.


#14

Unfortunately not. I could never reliably reproduce the issue and there
was a workaround so we moved forward with a change to our scripting to use
the accounting logs. Since then that project area, for us, has shut down
so I no longer have a sufficient test environment in which to chase the
issue. The workaround is to use the local accounting logs instead of
’qstat’ or pbs_jobstat() – when the stime component was dropped it did so
after having been successfully written to the
{PBS}/server/server_priv/accounting/{YYYYMMDD…}
records.


#15

Unfortunately not. I could never consistently reproduce the issue and
there was a workaround so we moved forward with a change to our scripting
to use the accounting logs. Since then that project area, for us, has shut
down so I no longer have a sufficient test environment in which to chase
the issue. The workaround is to use the local accounting logs instead of
’qstat’ or pbs_jobstat() – when the stime component was dropped it did so
after having been successfully written to the
{PBS}/server/server_priv/accounting/{YYYYMMDD…}
records.