Summarizing a Skype interaction between @jon and me:
The per-job interface fits because...
Although the main use case is node based (the admin wants to perform maintenance on a subset of nodes (e.g., a rack) or on the entire system (e.g., the set of all nodes)), there is also a requirement to minimize the impact of unresolved issues. in particular, if there are multiple jobs that were suspended for maintenance, the admin would like to resume one (or just a few) jobs to ensure everything is working correctly before resuming all jobs. This progressive testing requires a per-job interface (at least at the time of resumption).
On another topic, now that the design has been changed to include the requirement that the admin must explicitly disable scheduling new jobs (either via offlining nodes, explicitly stopping the scheduler(s), or scheduling dedicated time), there may not be a need to invent new job and node states. It may be reasonable to simply allow an admin to resume a suspended job (either without invoking the scheduler or at last onto an offlined/dedicated-timed node). In other words, the simplified path for maintenance would be something like:
a. offline the set of nodes on which to perform maintenance
b. suspend any jobs still running on those nodes
-. Now no jobs are running and no jobs will be started/resumed on those nodes
c. Perform maintenance
d. resume one (or a few) jobs and ensure the system is healthy -- requires a new ability for the admin to resume a job on an offlined node)
e. if the system is healthy, admin resumes the rest of the jobs and then marks nodes online