Skip to contents

rush is equipped with an advanced error-handling mechanism designed to manage and mitigate errors encountered during the execution of tasks. It adeptly handles a range of error scenarios, from standard R errors to more complex issues such as segmentation faults and network errors.t If all of this fails, the user can manually debug the worker loop.

Simple R Errors

To illustrate the error-handling mechanism in rush, we employ the random search example from the main vignette. This time we introduce a random error with a 50% probability. Within the worker loop, users are responsible for catching errors and marking the corresponding task as "failed" using the $push_failed() method.

library(rush)

branin = function(x1, x2) {
  (x2 - 5.1 / (4 * pi^2) * x1^2 + 5 / pi * x1 - 6)^2 + 10 * (1 - 1 / (8 * pi)) * cos(x1) + 10
}

wl_random_search = function(rush) {

  while(rush$n_finished_tasks < 100) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    tryCatch({
      if (runif(1) < 0.5) stop("Random Error")
      ys = list(y = branin(xs$x1, xs$x2))
      rush$push_results(key, yss = list(ys))
    }, error = function(e) {
      condition = list(message = e$message)
      rush$push_failed(key, conditions = list(condition))
    })

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

We start the workers.

rush = rsh(
  network = "test-simply-error",
  config = redux::redis_config())

rush$start_local_workers(
  worker_loop = wl_random_search,
  n_workers = 4,
  globals = "branin")

When an error occurs, the task is marked as "failed", and the error message is stored in the "message" column. This approach ensures that errors do not interrupt the overall execution process. It allows for subsequent inspection of errors and the reevaluation of failed tasks as necessary.

rush$fetch_failed_tasks()
              x1         x2   pid     worker_id       message          keys
           <num>      <num> <int>        <char>        <char>        <char>
 1: -3.939262037 11.0938948  9957 nondenomin... Random Err... ce0960b7-c...
 2: -0.174922138  9.6034227  9957 nondenomin... Random Err... e2374d2e-0...
 3: -1.716376172  1.5509859  9957 nondenomin... Random Err... 54451a5c-2...
 4:  3.995382390  7.6453966  9957 nondenomin... Random Err... 17d7cedb-3...
 5:  8.379411657  2.8247865  9957 nondenomin... Random Err... 6ee656e0-0...
 6:  9.239812285 10.5330109  9957 nondenomin... Random Err... 7f838397-d...
 7:  7.703772677 10.7531472  9957 nondenomin... Random Err... 7fa2002e-a...
 8: -4.388852264 10.4850933  9946 unsyllable... Random Err... 39035362-b...
 9: -2.890116418 10.8811033  9957 nondenomin... Random Err... 2a3eed07-3...
10:  0.001017108  7.9614377  9946 unsyllable... Random Err... 959fa9fc-1...
11: -4.481060121  5.7738797  9946 unsyllable... Random Err... f182015f-e...
12:  5.061318610  9.3136034  9957 nondenomin... Random Err... 64699843-b...
13:  3.171451255 10.8128029  9946 unsyllable... Random Err... cf7a2012-3...
14:  2.379436444  3.2896780  9946 unsyllable... Random Err... 8da7a6b6-8...
15:  8.114242236  2.6245788  9957 nondenomin... Random Err... 7197c19c-6...
16:  7.135142714  4.6033143  9946 unsyllable... Random Err... b11c8363-a...
17: -4.112069301  1.5860487  9946 unsyllable... Random Err... be76504e-1...
18: -4.675281935 13.2788318  9957 nondenomin... Random Err... 00c107e2-5...
19:  1.001759325 12.6048660  9946 unsyllable... Random Err... b12cfe98-9...
20: -1.577583251  2.9221014  9957 nondenomin... Random Err... 33c533ab-e...
21:  9.742985336 11.6440664  9957 nondenomin... Random Err... 57b4fa1f-9...
22: -2.018399487  4.5518021  9957 nondenomin... Random Err... fb2f8085-e...
23: -4.486697530 12.2606335  9957 nondenomin... Random Err... d8a8e870-5...
24:  0.554225253  8.2232417  9946 unsyllable... Random Err... 55d6e6eb-1...
25:  1.756557941  6.0753188  9946 unsyllable... Random Err... c0b18f6c-5...
26: -1.012255911  7.7493397  9946 unsyllable... Random Err... 5e7f08fd-2...
27: -1.242484332 11.0614507  9966 ladylike_a... Random Err... f47ed3db-d...
28:  6.488138094  0.8755216  9957 nondenomin... Random Err... 40616403-6...
29: -2.462076162 13.0287375  9946 unsyllable... Random Err... 6115c0f0-5...
30:  7.420961624  7.4996376  9966 ladylike_a... Random Err... eeafb719-5...
31:  6.899610372 14.4046736  9978 loopy_quil... Random Err... c3dcdb2f-4...
32:  7.430139106 14.7777983  9957 nondenomin... Random Err... 69a9b4fb-d...
33:  7.446088118  0.8903654  9946 unsyllable... Random Err... 9e23a0f9-0...
34:  8.063165565 12.3431853  9978 loopy_quil... Random Err... ceb26416-5...
              x1         x2   pid     worker_id       message          keys

Handling Failing Workers

The rush package provides mechanisms to address situations in which workers fail due to crashes or lost connections. Such failures may result in tasks remaining in the “running” state indefinitely. To illustrate this, we define a function that simulates a segmentation fault by terminating the worker process.

wl_failed_worker = function(rush) {
  xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
  key = rush$push_running_tasks(xss = list(xs))

  tools::pskill(Sys.getpid(), tools::SIGKILL)
}

rush = rsh(network = "test-failed-workers")

worker_ids =  rush$start_local_workers(
  worker_loop = wl_failed_worker,
  n_workers = 2)

The package offers the $detect_lost_workers() method, which is designed to identify and manage these occurrences.

rush$detect_lost_workers()
ERROR [10:02:54.742] [rush] Lost worker 'hilarious_harvestmen'
ERROR [10:02:54.788] [rush] Lost 1 task(s): f6c59a59-bf49-458f-9e9d-44afac776912
ERROR [10:02:54.797] [rush] Lost worker 'loony_buckeyebutterfly'
ERROR [10:02:54.799] [rush] Lost 1 task(s): 2f005ed1-5d64-4f1c-bc99-ef2135cc07ad

This method works for workers started with $start_local_workers() and $start_remote_workers(). Workers started with $worker_script() must be started with a heartbeat mechanism (see vignette).

The $detect_lost_workers() method also supports automatic restarting of lost workers when the option restart_workers = TRUE is specified. Alternatively, lost workers may be restarted manually using the $restart_workers() method. Automatic restarting is only available for local workers. When a worker fails, the status of the task that caused the failure is set to "failed".

rush$fetch_failed_tasks()
        x1       x2   pid     worker_id       message          keys
     <num>    <num> <int>        <char>        <char>        <char>
1: 5.76647 4.842074 10096 hilarious_... Worker has... f6c59a59-b...
2: 5.12228 5.844016 10099 loony_buck... Worker has... 2f005ed1-5...

Debugging

When the worker loop fails unexpectedly due to an uncaught error, it is necessary to debug the worker loop. Consider the following example, in which the worker loop randomly generates an error.

wl_error = function(rush) {

  repeat {
    x1 = runif(1)
    x2 = runif(1)

    xss = list(list(x1 = x1, x2 = x2))

    key = rush$push_running_tasks(xss = xss)

    if (x1 > 0.90) {
      stop("Unexpected error")
    }

    rush$push_results(key, yss = list(list(y = x1 + x2)))
  }
}

To begin debugging, the worker loop is executed locally. This requires the initialization of a RushWorker instance. Although the rush worker is typically created during worker initialization, it can also be instantiated manually. The worker instance is then passed as an argument to the worker loop.

rush_worker = RushWorker$new("test", remote = FALSE)

wl_error(rush_worker)
Error in wl_error(rush_worker): Unexpected error

When an error is raised in the main process, the traceback() function can be invoked to examine the stack trace. Breakpoints may also be set within the worker loop to inspect the program state. This approach provides substantial control over the debugging process. Certain errors, such as missing packages or undefined global variables, may not be encountered when running locally. However, such issues can be readily identified using the $detect_lost_workers() method.

rush = rsh("test-error")

rush$start_local_workers(
  worker_loop = wl_error,
  n_workers = 1
)

The $detect_lost_workers() method can be used to identify lost workers.

rush$detect_lost_workers()
ERROR [10:02:55.912] [rush] Lost worker 'definite_jerboa'
ERROR [10:02:55.914] [rush] Error in start_args$worker_loop(rush = rush) : Unexpected error
ERROR [10:02:55.915] [rush] Calls: <Anonymous> ... <Anonymous> -> eval.parent -> eval -> eval -> <Anonymous>
ERROR [10:02:55.915] [rush] Execution halted
ERROR [10:02:55.917] [rush] Lost 1 task(s): 98a18cc2-86e6-4b81-b3a7-4870f3900806