Error Handling and Debugging • rush

rush is equipped with an advanced error-handling mechanism designed to manage and mitigate errors encountered during the execution of tasks. It adeptly handles a range of error scenarios, from standard R errors to more complex issues such as segmentation faults and network errors.t If all of this fails, the user can manually debug the worker loop.

Simple R Errors

To illustrate the error-handling mechanism in rush, we employ the random search example from the main vignette. This time we introduce a random error with a 50% probability. Within the worker loop, users are responsible for catching errors and marking the corresponding task as "failed" using the $push_failed() method.

library(rush)

branin = function(x1, x2) {
  (x2 - 5.1 / (4 * pi^2) * x1^2 + 5 / pi * x1 - 6)^2 + 10 * (1 - 1 / (8 * pi)) * cos(x1) + 10
}

wl_random_search = function(rush) {

  while(rush$n_finished_tasks < 100) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    tryCatch({
      if (runif(1) < 0.5) stop("Random Error")
      ys = list(y = branin(xs$x1, xs$x2))
      rush$push_results(key, yss = list(ys))
    }, error = function(e) {
      condition = list(message = e$message)
      rush$push_failed(key, conditions = list(condition))
    })

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

We start the workers.

rush = rsh(
  network = "test-simply-error",
  config = redux::redis_config())

rush$start_local_workers(
  worker_loop = wl_random_search,
  n_workers = 4,
  globals = "branin")

When an error occurs, the task is marked as "failed", and the error message is stored in the "message" column. This approach ensures that errors do not interrupt the overall execution process. It allows for subsequent inspection of errors and the reevaluation of failed tasks as necessary.

rush$fetch_failed_tasks()

             x1         x2   pid     worker_id       message          keys
          <num>      <num> <int>        <char>        <char>        <char>
 1: -4.62937145  5.1332725  9967 moneygrubb... Random Err... 496dc558-0...
 2: -0.03978035 10.0390920  9967 moneygrubb... Random Err... a3c12aee-5...
 3:  5.92615034 10.0873599  9941 xylophagou... Random Err... 85e908e9-8...
 4:  7.22822967 13.4950279  9941 xylophagou... Random Err... 55af02e3-4...
 5: -1.81592018 14.2190794  9941 xylophagou... Random Err... 42119695-8...
 6:  2.65214272 12.2201477  9941 xylophagou... Random Err... 29b08557-1...
 7:  2.39096117  8.4918450  9941 xylophagou... Random Err... 7018191d-6...
 8: -0.03962529  5.2141385  9941 xylophagou... Random Err... 5e526dbf-5...
 9:  8.93351642  5.5300935  9957 dimwitted_... Random Err... cdaeecc2-3...
10:  4.20961951 11.5359657  9941 xylophagou... Random Err... 2342f3c6-b...
11:  2.22512382  6.9920344  9941 xylophagou... Random Err... 0b468427-9...
12: -3.26578878  9.8612512  9957 dimwitted_... Random Err... 7cb38e87-f...
13:  2.52353405  4.7189409  9941 xylophagou... Random Err... ea1de82b-4...
14:  1.36828127  8.9738609  9952 antiromant... Random Err... 7109d2d1-6...
15:  3.93337447  8.7322436  9957 dimwitted_... Random Err... 4256e6dc-7...
16:  2.70468913  2.8866594  9952 antiromant... Random Err... 15b8e938-4...
17:  1.56072991  8.7641788  9941 xylophagou... Random Err... 92848197-2...
18:  1.28836758  0.4760839  9957 dimwitted_... Random Err... d1cbde40-8...
19: -2.15227660  7.2202855  9952 antiromant... Random Err... 84139c66-2...
20:  3.46805490  9.4016300  9941 xylophagou... Random Err... 0bf4ce02-d...
21:  4.18840201  9.8557463  9957 dimwitted_... Random Err... 0ca82443-6...
22: -1.62089735  6.3709387  9941 xylophagou... Random Err... d20e0b0f-2...
23:  5.71826451  9.9298201  9967 moneygrubb... Random Err... 81d858c7-c...
24: -0.85904890  0.2268903  9941 xylophagou... Random Err... 7e9bc5c8-e...
25:  6.37296755 11.5260240  9952 antiromant... Random Err... 95c6e66d-b...
26:  2.12691557 10.1337243  9957 dimwitted_... Random Err... 0d05c042-7...
27:  7.20241783 14.8599650  9967 moneygrubb... Random Err... 5325d0ca-7...
28:  6.88437629 14.5943116  9957 dimwitted_... Random Err... 5b5d0187-9...
29:  3.00837592  4.3066591  9957 dimwitted_... Random Err... e18451b7-6...
30:  3.95936323  4.2601726  9967 moneygrubb... Random Err... cb054f50-7...
31:  5.67562683  1.8337402  9941 xylophagou... Random Err... d38095b3-c...
32: -0.43410169 11.2058779  9952 antiromant... Random Err... a54e3766-6...
33: -2.79465405 12.5674139  9967 moneygrubb... Random Err... 08fbe176-3...
34:  0.74052743  0.4296175  9941 xylophagou... Random Err... e5e96cb5-d...
             x1         x2   pid     worker_id       message          keys

Handling Failing Workers

The rush package provides mechanisms to address situations in which workers fail due to crashes or lost connections. Such failures may result in tasks remaining in the “running” state indefinitely. To illustrate this, we define a function that simulates a segmentation fault by terminating the worker process.

wl_failed_worker = function(rush) {
  xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
  key = rush$push_running_tasks(xss = list(xs))

  tools::pskill(Sys.getpid(), tools::SIGKILL)
}

rush = rsh(network = "test-failed-workers")

worker_ids =  rush$start_local_workers(
  worker_loop = wl_failed_worker,
  n_workers = 2)

The package offers the $detect_lost_workers() method, which is designed to identify and manage these occurrences.

rush$detect_lost_workers()

ERROR [10:06:21.865] [rush] Lost worker 'ritzy_asiaticgreaterfreshwaterclam'
ERROR [10:06:21.909] [rush] Lost 1 task(s): 80ecb4d1-dc41-4880-b88e-99e859af639f
ERROR [10:06:21.917] [rush] Lost worker 'overweak_kouprey'
ERROR [10:06:21.919] [rush] Lost 1 task(s): 0a1a08e3-d9d3-408c-809c-133a04b2450f

This method works for workers started with $start_local_workers() and $start_remote_workers(). Workers started with $worker_script() must be started with a heartbeat mechanism (see vignette).

The $detect_lost_workers() method also supports automatic restarting of lost workers when the option restart_workers = TRUE is specified. Alternatively, lost workers may be restarted manually using the $restart_workers() method. Automatic restarting is only available for local workers. When a worker fails, the status of the task that caused the failure is set to "failed".

rush$fetch_failed_tasks()

             x1       x2   pid     worker_id       message          keys
          <num>    <num> <int>        <char>        <char>        <char>
1:  0.004870265 7.680448 10090 ritzy_asia... Worker has... 80ecb4d1-d...
2: -0.932877616 2.933450 10092 overweak_k... Worker has... 0a1a08e3-d...

Debugging

When the worker loop fails unexpectedly due to an uncaught error, it is necessary to debug the worker loop. Consider the following example, in which the worker loop randomly generates an error.

wl_error = function(rush) {

  repeat {
    x1 = runif(1)
    x2 = runif(1)

    xss = list(list(x1 = x1, x2 = x2))

    key = rush$push_running_tasks(xss = xss)

    if (x1 > 0.90) {
      stop("Unexpected error")
    }

    rush$push_results(key, yss = list(list(y = x1 + x2)))
  }
}

To begin debugging, the worker loop is executed locally. This requires the initialization of a RushWorker instance. Although the rush worker is typically created during worker initialization, it can also be instantiated manually. The worker instance is then passed as an argument to the worker loop.

rush_worker = RushWorker$new("test", remote = FALSE)

wl_error(rush_worker)

Error in wl_error(rush_worker): Unexpected error

When an error is raised in the main process, the traceback() function can be invoked to examine the stack trace. Breakpoints may also be set within the worker loop to inspect the program state. This approach provides substantial control over the debugging process. Certain errors, such as missing packages or undefined global variables, may not be encountered when running locally. However, such issues can be readily identified using the $detect_lost_workers() method.

rush = rsh("test-error")

rush$start_local_workers(
  worker_loop = wl_error,
  n_workers = 1
)

The $detect_lost_workers() method can be used to identify lost workers.

rush$detect_lost_workers()

ERROR [10:06:23.044] [rush] Lost worker 'agrarian_peccary'
ERROR [10:06:23.046] [rush] Error in start_args$worker_loop(rush = rush) : Unexpected error
ERROR [10:06:23.047] [rush] Calls: <Anonymous> ... <Anonymous> -> eval.parent -> eval -> eval -> <Anonymous>
ERROR [10:06:23.047] [rush] Execution halted
ERROR [10:06:23.049] [rush] Lost 1 task(s): 9b4df655-e0f9-4b5f-afb9-4c975e4d98bc