Skip to contents

rush is equipped with an advanced error-handling mechanism designed to manage and mitigate errors encountered during the execution of tasks. It adeptly handles a range of error scenarios, from standard R errors to more complex issues such as segmentation faults and network errors.t If all of this fails, the user can manually debug the worker loop.

Simple R Errors

To illustrate the error-handling mechanism in rush, we employ the random search example from the main vignette. This time we introduce a random error with a 50% probability. Within the worker loop, users are responsible for catching errors and marking the corresponding task as "failed" using the $push_failed() method.

library(rush)

branin = function(x1, x2) {
  (x2 - 5.1 / (4 * pi^2) * x1^2 + 5 / pi * x1 - 6)^2 + 10 * (1 - 1 / (8 * pi)) * cos(x1) + 10
}

wl_random_search = function(rush) {

  while(rush$n_finished_tasks < 100) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    tryCatch({
      if (runif(1) < 0.5) stop("Random Error")
      ys = list(y = branin(xs$x1, xs$x2))
      rush$push_results(key, yss = list(ys))
    }, error = function(e) {
      condition = list(message = e$message)
      rush$push_failed(key, conditions = list(condition))
    })

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

We start the workers.

rush = rsh(
  network = "test-simply-error",
  config = redux::redis_config())

rush$start_local_workers(
  worker_loop = wl_random_search,
  n_workers = 4,
  globals = "branin")

When an error occurs, the task is marked as "failed", and the error message is stored in the "message" column. This approach ensures that errors do not interrupt the overall execution process. It allows for subsequent inspection of errors and the reevaluation of failed tasks as necessary.

rush$fetch_failed_tasks()
            x1         x2   pid     worker_id       message          keys
         <num>      <num> <int>        <char>        <char>        <char>
 1:  8.0906074 13.3591459  9848 compactabl... Random Err... 30420130-d...
 2:  5.8933154  0.6728762  9848 compactabl... Random Err... cbcc66a6-1...
 3:  5.4836588  4.2697229  9859 forlorn_dr... Random Err... ece4b7c7-d...
 4:  0.3803802 10.4098960  9848 compactabl... Random Err... 50fac141-3...
 5:  3.8186293  5.1883939  9848 compactabl... Random Err... 0364d271-7...
 6:  7.8743191 11.6032228  9848 compactabl... Random Err... 77c338c5-2...
 7:  5.2847575 12.6725262  9848 compactabl... Random Err... 2ebd2a49-c...
 8:  8.9994572 10.4234423  9848 compactabl... Random Err... 3c458c64-8...
 9:  7.8056723 10.3059636  9848 compactabl... Random Err... 115c4891-6...
10: -0.3684205  3.9795342  9864 flirty_nut... Random Err... 3f82d4f2-3...
11:  6.0778035  8.5303972  9859 forlorn_dr... Random Err... 68604671-9...
12:  1.7511949 10.2204729  9864 flirty_nut... Random Err... d8fbd5b8-5...
13:  1.7116343 12.4479887  9864 flirty_nut... Random Err... d350eb4d-d...
14: -4.7299152 12.5371453  9859 forlorn_dr... Random Err... c213abbb-7...
15:  3.1162054  3.9181199  9848 compactabl... Random Err... 2efea82a-0...
16: -4.3605668 10.5965080  9848 compactabl... Random Err... 92de9743-a...
17: -3.6551491  7.0420734  9859 forlorn_dr... Random Err... f1cc5999-c...
18:  9.7344035  8.3605810  9848 compactabl... Random Err... fb21e19b-3...
19: -4.8265745  0.2360828  9874 olympic_ha... Random Err... 58293682-0...
20:  3.9698489  9.9293751  9848 compactabl... Random Err... ec77ec02-5...
21: -2.9853390  0.3679575  9864 flirty_nut... Random Err... 88e2e84f-c...
22:  9.7877724  5.2888907  9864 flirty_nut... Random Err... 2adcc992-9...
23:  6.0454680  9.3768576  9859 forlorn_dr... Random Err... 1af1b35e-5...
24:  4.5997959  4.7931469  9848 compactabl... Random Err... 38396835-6...
25:  0.4833615  8.7592558  9848 compactabl... Random Err... a974380b-f...
26: -3.7787325  9.3050629  9874 olympic_ha... Random Err... e0bff727-c...
            x1         x2   pid     worker_id       message          keys

Handling Failing Workers

The rush package provides mechanisms to address situations in which workers fail due to crashes or lost connections. Such failures may result in tasks remaining in the “running” state indefinitely. To illustrate this, we define a function that simulates a segmentation fault by terminating the worker process.

wl_failed_worker = function(rush) {
  xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
  key = rush$push_running_tasks(xss = list(xs))

  tools::pskill(Sys.getpid(), tools::SIGKILL)
}

rush = rsh(network = "test-failed-workers")

worker_ids =  rush$start_local_workers(
  worker_loop = wl_failed_worker,
  n_workers = 2)

The package offers the $detect_lost_workers() method, which is designed to identify and manage these occurrences.

rush$detect_lost_workers()

This method works for workers started with $start_local_workers() and $start_remote_workers(). Workers started with $worker_script() must be started with a heartbeat mechanism (see vignette).

The $detect_lost_workers() method also supports automatic restarting of lost workers when the option restart_workers = TRUE is specified. Alternatively, lost workers may be restarted manually using the $restart_workers() method. Automatic restarting is only available for local workers. When a worker fails, the status of the task that caused the failure is set to "failed".

rush$fetch_failed_tasks()
          x1        x2   pid     worker_id       message          keys
       <num>     <num> <int>        <char>        <char>        <char>
1: 3.3845317 14.766141  9997 fortified_... Worker has... bd33cd71-4...
2: 0.9302363  1.979994  9999 supersecul... Worker has... b6a395c8-8...

Debugging

When the worker loop fails unexpectedly due to an uncaught error, it is necessary to debug the worker loop. Consider the following example, in which the worker loop randomly generates an error.

wl_error = function(rush) {

  repeat {
    x1 = runif(1)
    x2 = runif(1)

    xss = list(list(x1 = x1, x2 = x2))

    key = rush$push_running_tasks(xss = xss)

    if (x1 > 0.90) {
      stop("Unexpected error")
    }

    rush$push_results(key, yss = list(list(y = x1 + x2)))
  }
}

To begin debugging, the worker loop is executed locally. This requires the initialization of a RushWorker instance. Although the rush worker is typically created during worker initialization, it can also be instantiated manually. The worker instance is then passed as an argument to the worker loop.

rush_worker = RushWorker$new("test", remote = FALSE)

wl_error(rush_worker)
Error in wl_error(rush_worker): Unexpected error

When an error is raised in the main process, the traceback() function can be invoked to examine the stack trace. Breakpoints may also be set within the worker loop to inspect the program state. This approach provides substantial control over the debugging process. Certain errors, such as missing packages or undefined global variables, may not be encountered when running locally. However, such issues can be readily identified using the $detect_lost_workers() method.

rush = rsh("test-error")

rush$start_local_workers(
  worker_loop = wl_error,
  n_workers = 1
)

The $detect_lost_workers() method can be used to identify lost workers.

rush$detect_lost_workers()

Output and message logs can be written to files by specifying the message_log and output_log arguments.

rush = rsh("test-error")

message_log = tempdir()
output_log = tempdir()

worker_ids = rush$start_local_workers(
  worker_loop = wl_error,
  n_workers = 1,
  message_log = message_log,
  output_log = output_log
)

Sys.sleep(5)

readLines(file.path(message_log, sprintf("message_%s.log", worker_ids[1])))
[1] "Debug message logging on worker glistening_australiancurlew started"
[2] "Error in start_args$worker_loop(rush = rush) : Unexpected error"
[3] "Calls: <Anonymous> ... <Anonymous> -> eval.parent -> eval -> eval -> <Anonymous>"
[4] "Execution halted"                                                                
readLines(file.path(output_log, sprintf("output_%s.log", worker_ids[1])))
[1] "[1] \"Debug output logging on worker glistening_australiancurlew started\""