I agree, we should start with a simple counter.
We had some more discussion and we were thinking that we should have 3 simple counters, hence summarizing them.
While we would want to save space by limiting the number of failure analysis data collected, we will also want to save time by stopping rest of the test cases when there are too many failures.We could use the same threshold for limiting saving the failure data and stop running the test cases but there may be cases where we may want to limit saving data but still want to run more test cases. So we may, at times, want different values for these thresholds.
These thresholds are at a test suite level. At a regression level, we may want to stop running the regression if there are too many failures. All these three can be understood and could be implemented at the same time, so broadening the scope of this discussion.
Have 3 Global parameters
TC_FAILURE_THRESHOLD - If number of failures in a test suite exceed this count, the execution of the test suite does not progress for the test suite
CUMULATIVE_TC_FAILURE_THRESHOLD - If the number of failures in the whole regression exceeds this count, the execution of the regression is stopped.
MAX_DIAG_THRESHOLD - The pbs_diag data is collected per failure. When the number of failure exceeds MAX_DIAG_THRESHOLD within a test suite, we shall stop collecting the diagnostic data.
TC_FAILURE_THRESHOLD & CUMULATIVE_TC_FAILURE_THRESHOLD is intended to save time in running regression, if there are too many failures at a test suite level or at an overall level respectively.
MAX_DIAG_THRESHOLD is intended to save space when there is too many failures in a build.
We should also have a command line option to benchpress to override these values.
--tc_falure_threshold=20 if you want 20 as a threshold
"0" can be specified as the value, if you do not want any threshold to apply.