Skip to contents

rush is equipped with an advanced error-handling mechanism designed to manage and mitigate errors encountered during the execution of tasks. It adeptly handles a range of error scenarios, from standard R errors to more complex issues such as segmentation faults and network errors.t If all of this fails, the user can manually debug the worker loop.

Simple R Errors

To illustrate the error-handling mechanism in rush, we employ the random search example from the main vignette. This time we introduce a random error with a 50% probability. Within the worker loop, users are responsible for catching errors and marking the corresponding task as "failed" using the $push_failed() method.

library(rush)

branin = function(x1, x2) {
  (x2 - 5.1 / (4 * pi^2) * x1^2 + 5 / pi * x1 - 6)^2 + 10 * (1 - 1 / (8 * pi)) * cos(x1) + 10
}

wl_random_search = function(rush) {

  while(rush$n_finished_tasks < 100) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    tryCatch({
      if (runif(1) < 0.5) stop("Random Error")
      ys = list(y = branin(xs$x1, xs$x2))
      rush$push_results(key, yss = list(ys))
    }, error = function(e) {
      condition = list(message = e$message)
      rush$push_failed(key, conditions = list(condition))
    })

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

We start the workers.

rush = rsh(
  network = "test-simply-error",
  config = redux::redis_config())

rush$start_local_workers(
  worker_loop = wl_random_search,
  n_workers = 4,
  globals = "branin")

When an error occurs, the task is marked as "failed", and the error message is stored in the "message" column. This approach ensures that errors do not interrupt the overall execution process. It allows for subsequent inspection of errors and the reevaluation of failed tasks as necessary.

rush$fetch_failed_tasks()
            x1         x2   pid     worker_id       message          keys
         <num>      <num> <int>        <char>        <char>        <char>
 1: -2.4888021 11.5313612 10106 spellbound... Random Err... 340836bb-4...
 2: -4.6737684 14.7951022 10106 spellbound... Random Err... bcd2a153-d...
 3:  8.6032609 10.8376433 10106 spellbound... Random Err... d9e667df-f...
 4:  7.4260542  2.7853557 10106 spellbound... Random Err... b105c0b4-6...
 5:  6.9466814  9.5998517 10106 spellbound... Random Err... 64f073b3-2...
 6: -2.5192185 13.5206717 10133 semiagricu... Random Err... 4370bf48-8...
 7:  7.4862806  6.7489541 10117 famished_b... Random Err... 5d5814be-2...
 8: -3.0374625 10.0479613 10106 spellbound... Random Err... a2bc642f-d...
 9: -4.0258764  5.4925936 10106 spellbound... Random Err... e1a81029-b...
10:  8.3215845  4.8944095 10126 proportion... Random Err... efd1d921-1...
11:  1.6669209  8.7738077 10106 spellbound... Random Err... c80dd70b-a...
12:  4.2122728  3.9982916 10106 spellbound... Random Err... 6bf46caa-2...
13:  2.5754890  5.5273478 10106 spellbound... Random Err... 121eaadf-9...
14: -4.8361909  2.0938835 10117 famished_b... Random Err... e997c641-4...
15: -1.9505247 13.7498667 10106 spellbound... Random Err... 195aa7d1-9...
16: -1.3537847 13.6833866 10133 semiagricu... Random Err... 3123f6ac-9...
17:  8.6005729 12.4618982 10117 famished_b... Random Err... 65170e84-1...
18:  4.3166025  1.8707294 10133 semiagricu... Random Err... 142ec2b0-7...
19:  4.8502890  4.4391489 10126 proportion... Random Err... a48e5453-c...
20:  5.0583694  1.7940419 10133 semiagricu... Random Err... 6c0b61a3-9...
21:  4.0437361 14.5826796 10117 famished_b... Random Err... 269487e8-c...
22: -0.7518501  3.2461866 10126 proportion... Random Err... 77194ff7-9...
23: -0.2936387  2.0670232 10133 semiagricu... Random Err... 92324856-f...
24:  4.0451331 12.0258418 10106 spellbound... Random Err... a71b9706-8...
25:  8.9136333  7.9830262 10133 semiagricu... Random Err... 6cc04b86-f...
26:  0.9300570  4.5352705 10106 spellbound... Random Err... c57963ed-c...
27: -2.3083398  5.4769315 10117 famished_b... Random Err... e6bec24d-5...
28:  7.5681165  5.9405446 10126 proportion... Random Err... d059b27f-d...
29:  7.5113645  0.4358976 10106 spellbound... Random Err... fef7e9de-d...
30: -4.2595155  5.0444093 10117 famished_b... Random Err... b1956399-1...
31:  5.5176711 12.3243642 10133 semiagricu... Random Err... d5c16408-3...
32:  0.1168967  0.6783157 10106 spellbound... Random Err... ae56c754-f...
33:  3.7366811  7.4044096 10117 famished_b... Random Err... 20336fdd-1...
34:  4.4154256  2.8982955 10126 proportion... Random Err... 056fa2c2-a...
35:  5.1985705 11.5843411 10133 semiagricu... Random Err... d8fa3035-2...
36:  5.2449016  2.2199768 10117 famished_b... Random Err... bb30b3a6-5...
37:  5.3981037 12.3487347 10106 spellbound... Random Err... af5fa42b-b...
38:  5.9883665  6.3681756 10126 proportion... Random Err... 28485731-e...
39:  3.0295212  2.8790237 10117 famished_b... Random Err... 5827df09-2...
40: -4.6868331  1.4067136 10106 spellbound... Random Err... 807a105c-8...
41: -1.7844081 10.2121463 10126 proportion... Random Err... 7994afde-2...
            x1         x2   pid     worker_id       message          keys

Handling Failing Workers

The rush package provides mechanisms to address situations in which workers fail due to crashes or lost connections. Such failures may result in tasks remaining in the “running” state indefinitely. To illustrate this, we define a function that simulates a segmentation fault by terminating the worker process.

wl_failed_worker = function(rush) {
  xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
  key = rush$push_running_tasks(xss = list(xs))

  tools::pskill(Sys.getpid(), tools::SIGKILL)
}

rush = rsh(network = "test-failed-workers")

worker_ids =  rush$start_local_workers(
  worker_loop = wl_failed_worker,
  n_workers = 2)

The package offers the $detect_lost_workers() method, which is designed to identify and manage these occurrences.

rush$detect_lost_workers()

This method works for workers started with $start_local_workers() and $start_remote_workers(). Workers started with $worker_script() must be started with a heartbeat mechanism (see vignette).

The $detect_lost_workers() method also supports automatic restarting of lost workers when the option restart_workers = TRUE is specified. Alternatively, lost workers may be restarted manually using the $restart_workers() method. Automatic restarting is only available for local workers. When a worker fails, the status of the task that caused the failure is set to "failed".

rush$fetch_failed_tasks()
          x1       x2   pid     worker_id       message          keys
       <num>    <num> <int>        <char>        <char>        <char>
1:  8.166792 5.300527 10255 nonpoisono... Worker has... 33de51d9-6...
2: -3.191851 7.284502 10258 cactuslike... Worker has... 9ff00d83-a...

Debugging

When the worker loop fails unexpectedly due to an uncaught error, it is necessary to debug the worker loop. Consider the following example, in which the worker loop randomly generates an error.

wl_error = function(rush) {

  repeat {
    x1 = runif(1)
    x2 = runif(1)

    xss = list(list(x1 = x1, x2 = x2))

    key = rush$push_running_tasks(xss = xss)

    if (x1 > 0.90) {
      stop("Unexpected error")
    }

    rush$push_results(key, yss = list(list(y = x1 + x2)))
  }
}

To begin debugging, the worker loop is executed locally. This requires the initialization of a RushWorker instance. Although the rush worker is typically created during worker initialization, it can also be instantiated manually. The worker instance is then passed as an argument to the worker loop.

rush_worker = RushWorker$new("test", remote = FALSE)

wl_error(rush_worker)
Error in wl_error(rush_worker): Unexpected error

When an error is raised in the main process, the traceback() function can be invoked to examine the stack trace. Breakpoints may also be set within the worker loop to inspect the program state. This approach provides substantial control over the debugging process. Certain errors, such as missing packages or undefined global variables, may not be encountered when running locally. However, such issues can be readily identified using the $detect_lost_workers() method.

rush = rsh("test-error")

rush$start_local_workers(
  worker_loop = wl_error,
  n_workers = 1
)

The $detect_lost_workers() method can be used to identify lost workers.

rush$detect_lost_workers()

Output and message logs can be written to files by specifying the message_log and output_log arguments.

rush = rsh("test-error")

message_log = tempdir()
output_log = tempdir()

worker_ids = rush$start_local_workers(
  worker_loop = wl_error,
  n_workers = 1,
  message_log = message_log,
  output_log = output_log
)

Sys.sleep(5)

readLines(file.path(message_log, sprintf("message_%s.log", worker_ids[1])))
[1] "Debug message logging on worker subspheric_tomtit started"
[2] "Error in start_args$worker_loop(rush = rush) : Unexpected error"
[3] "Calls: <Anonymous> ... <Anonymous> -> eval.parent -> eval -> eval -> <Anonymous>"
[4] "Execution halted"                                                                
readLines(file.path(output_log, sprintf("output_%s.log", worker_ids[1])))
[1] "[1] \"Debug output logging on worker subspheric_tomtit started\""