lava_wait_jobs.py: Disable inefficient top-level retries
These retrie causes he whole processing to restart from the beginning
on the slightest error. E.g., if there were 500 test jobs, and there was
and error fetching a result for last one, all 500 will be resubmitted,
waited, and fetched again. This is highly inefficient way to handle
retries, which pushed LAVA under the heavy load from prolonged amount
of time, which affects TrustedFirmware project in general (i.e. not
only TF-M, but also TF-A and maybe other sub-projects).
Instead, retries should happen on the lowest reasonable level, e.g.
on the level of a particular request, or a particular job. A good
example of doing it right is a recent change 81ff0ad8cde35276859b.
So, disable inefficient retries, instead let's better get immediate
exception, then gradually add retries at the right spot.
Signed-off-by: Paul Sokolovsky <paul.sokolovsky@linaro.org>
Change-Id: Ib35568d2bbf17ec6afb56557fb57201aa9166c5f
diff --git a/lava_helper/lava_wait_jobs.py b/lava_helper/lava_wait_jobs.py
index be69a29..2f4268f 100755
--- a/lava_helper/lava_wait_jobs.py
+++ b/lava_helper/lava_wait_jobs.py
@@ -186,14 +186,20 @@
if not silent:
print("INFO: {}".format(line))
+
+# WARNING: Setting this to >1 is a last resort, temporary stop-gap measure,
+# which will overload LAVA and jeopardize stability of the entire TF CI.
+INEFFICIENT_RETRIES = 1
+
+
def main(user_args):
""" Main logic """
- for try_time in range(3):
+ for try_time in range(INEFFICIENT_RETRIES):
try:
finished_jobs = wait_for_jobs(user_args)
break
except Exception as e:
- if try_time < 2:
+ if try_time < INEFFICIENT_RETRIES - 1:
_log.exception("Exception in wait_for_jobs")
_log.info("Will try to get LAVA jobs again, this was try: %d", try_time)
else: