Controller • rush

A rush network consists of thousands of workers started on different machines. This article explains how to start workers in a rush network. Rush offers three ways to start workers: local workers with processx, remote workers with mirai and script workers.

Local Workers

We use the random search example from the Rush article to demonstrate how the controller works.

library(rush)

wl_random_search = function(rush, branin) {
  while(TRUE) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

rush = rsh(
  network = "test-network",
  config = redux::redis_config())

branin = function(x1, x2) {
  (x2 - 5.1 / (4 * pi^2) * x1^2 + 5 / pi * x1 - 6)^2 + 10 * (1 - 1 / (8 * pi)) * cos(x1) + 10
}

Start Workers

Workers may be initiated locally or remotely. Local workers run on the same machine as the controller, whereas remote workers operate on separate machines. The $start_local_workers() method initiates local workers using the processx package. The n_workers parameter specifies the number of workers to launch. The worker_loop parameter defines the function executed by each worker. If the worker_loop function depends on additional objects, these can be passed as arguments to worker_loop. Required packages for the worker_loop can be specified using the packages parameter.

worker_ids = rush$start_local_workers(
  worker_loop = wl_random_search,
  branin = branin,
  n_workers = 2)

Worker information is accessible through the $worker_info method. Each worker is identified by a worker_id. The pid field denotes the process identifier, and the hostname field indicates the machine name. The remote column specifies whether the worker is remote, and the heartbeat column indicates the presence of a heartbeat process. The state column indicates the current state of the worker. Possible states include "running", "terminated", "killed", and "lost". Heartbeat mechanisms are discussed in the Error Handling section.

rush$worker_info

       worker_id   pid      hostname heartbeat   state
          <char> <int>        <char>    <lgcl>  <char>
1: overrated_...  8888 runnervmwf...     FALSE running
2: addlebrain...  8899 runnervmwf...     FALSE running

Additional workers may be added to the network at any time.

rush$start_local_workers(
  worker_loop = wl_random_search,
  branin = branin,
  n_workers = 2)

rush$worker_info

       worker_id   pid      hostname heartbeat   state
          <char> <int>        <char>    <lgcl>  <char>
1: overrated_...  8888 runnervmwf...     FALSE running
2: addlebrain...  8899 runnervmwf...     FALSE running
3: nonmathema...  8963 runnervmwf...     FALSE running
4: rollable_a...  8965 runnervmwf...     FALSE running

Rush Plan

When rush is integrated into a third-party package, the starting of workers is typically managed by the package itself. In such cases, users may configure worker options by invoking the rush_plan() function. This function allows explicit specification of the number of workers, the type of workers, and the configuration for connecting to the Redis database.

rush_plan(n_workers = 2, config = redux::redis_config(), worker_type = "local")

Passing data to workers

Objects required by the worker loop can be passed as arguments to $start_local_workers() / $start_remote_workers(). These arguments are serialized and stored in the Redis database as part of the worker configuration. Upon initialization, each worker retrieves and unserializes the worker configuration before invoking the worker loop.

Note

The maximum size of a Redis string is 512 MiB. If the serialized worker configuration exceeds this limit, Rush will raise an error. In scenarios where both the controller and the workers have access to a shared file system, Rush will instead write large objects to disk. The large_objects_path argument of rush_plan() specifies the directory used for storing such large objects.

Stop Worker

Workers can be stopped individually or all at once. To terminate a specific worker, the $stop_workers() method is invoked with the corresponding worker_ids argument.

rush$stop_workers(worker_ids = worker_ids[1])

This command terminates the selected worker process.

rush$worker_info

       worker_id   pid      hostname heartbeat      state
          <char> <int>        <char>    <lgcl>     <char>
1: addlebrain...  8899 runnervmwf...     FALSE    running
2: nonmathema...  8963 runnervmwf...     FALSE    running
3: rollable_a...  8965 runnervmwf...     FALSE    running
4: overrated_...  8888 runnervmwf...     FALSE terminated

To stop all workers and reset the network, the $reset() method is used.

rush$reset()

Instead of killing the worker processes, it is also possible to send a terminate signal to the worker. The worker then terminates after it has finished the current task. The worker loop must implement the rush$terminated flag. Then the rush controller can terminate the optimization.

wl_random_search = function(rush, branin) {
  while(!rush$terminated) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

rush = rsh(
  network = "test-random-search-terminate",
  config = redux::redis_config())

rush$start_local_workers(
  worker_loop = wl_random_search,
  n_workers = 2,
  branin = branin)

The random search proceeds as usual.

rush$fetch_finished_tasks()

         worker_id         x1        x2         y
            <char>      <num>     <num>     <num>
  1: decipherab...  0.7669128  1.126755 30.816859
  2: decipherab...  3.3627731  1.471132  1.038462
  3: decipherab...  9.0965619 10.273772 65.901677
  4: decipherab... -1.7619444  7.322755 11.719586
  5: decipherab... -2.1657590 14.239285 22.144437
 ---
567: bohrium_an... -2.8609888 10.838236  1.370318
568: decipherab...  3.5747760 11.577518 93.754153
569: bohrium_an...  1.8872181  2.360237  8.213926
570: decipherab... -1.1501926  4.866211 23.750643
571: bohrium_an...  3.2400858  3.508041  2.156854
                                     keys
                                   <list>
  1: 52f12f31-59b2-47f8-a953-e0f86ef18820
  2: 283a640d-d297-4ba6-9281-246887bde5ab
  3: 32c44c29-4388-4ec6-be42-6a7f51d431f6
  4: 6e654531-42e7-404b-9872-bec0f0ab8250
  5: fd8a27e6-611a-406b-bd6f-868f45730677
 ---
567: 67cab484-c4c9-411a-a9bc-c8b7654a2523
568: 2e2a2e00-81c1-4414-a44d-f98067fd69cb
569: 9f6f9395-5f9e-4c89-ad0c-15f04ad208f8
570: ada98194-db22-412d-a957-056f407443e5
571: 89588a7d-d7e8-4eb2-97b8-29fc655a1645

To terminate the optimization, the following command is used.

rush$stop_workers(type = "terminate")

The workers are terminated.

rush$worker_info

       worker_id   pid      hostname heartbeat      state
          <char> <int>        <char>    <lgcl>     <char>
1: bohrium_an...  9062 runnervmwf...     FALSE terminated
2: decipherab...  9064 runnervmwf...     FALSE terminated

Failed Workers

We simulate a segfault on the worker by killing the worker process.

rush = rsh(network = "test-failed-workers")

wl_failed_worker = function(rush) {
  tools::pskill(Sys.getpid(), tools::SIGKILL)
}

Workers are then started using the faulty worker loop.

worker_ids =  rush$start_local_workers(
  worker_loop = wl_failed_worker,
  n_workers = 2)

The $detect_lost_workers() method is used to identify failed workers. When a worker is detected, its state is updated to "lost".

rush$detect_lost_workers()

[1] "nervous_aardwolf"        "aerophobic_whitepelican"

rush$worker_info

       worker_id   pid      hostname heartbeat      state
          <char> <int>        <char>    <lgcl>     <char>
1: nervous_aa...  9138 runnervmwf...     FALSE terminated
2: aerophobic...  9140 runnervmwf...     FALSE terminated

Log Messages

Workers record all messages generated using the lgr package to the database. The lgr_thresholds argument of $start_local_workers() specifies the logging level for each logger, e.g. c("mlr3/rush" = "debug"). While enabling log message storage introduces a minor performance overhead, it is valuable for debugging purposes. By default, log messages are not stored. To enable logging, workers are started with the desired logging threshold.

rush = rsh(network = "test-log-messages")

wl_log_message = function(rush) {
  lg = lgr::get_logger("mlr3/rush")
  lg$info("This is an info message from worker %s", rush$worker_id)
}

rush$start_local_workers(
  worker_loop = wl_log_message,
  n_workers = 2,
  lgr_thresholds = c(rush = "info"))

The most recent log messages can be printed as follows.

Sys.sleep(1)
rush$print_log()

To retrieve all log entries, use the $read_log() method.

rush$read_log()

Null data.table (0 rows and 0 cols)

We reset the network.

rush$reset()

Remote Workers

The mirai package provides a straightforward mechanism for launching rush workers on remote machines. mirai manages daemons, which are persistent background processes capable of executing arbitrary R code in parallel. These daemons communicate with the main session.

library(mirai)

Start Workers

Usually mirai is used to start daemons on remote machines but it can also be used to start local daemons.

daemons(
  n = 2,
  url = host_url()
)

Daemons may also be launched on a remote machine via SSH.

daemons(
  n = 2L,
  url = host_url(port = 5555),
  remote = ssh_config(remotes = "ssh://10.75.32.90")
)

On high performance computing clusters, daemons can be started using a scheduler.

daemons(
  n = 2L,
  url = host_url(),
  remote = remote_config(
    command = "sbatch",
    args = c("--mem 512", "-n 1", "--wrap", "."),
    rscript = file.path(R.home("bin"), "Rscript"),
    quote = TRUE
  )
)

We define a worker loop.

wl_random_search = function(rush, branin) {
  while(TRUE) {

    xs = list(x1 = runif(1, -5, 10), x2 = runif(1, 0, 15))
    key = rush$push_running_tasks(xss = list(xs))

    ys = list(y = branin(xs$x1, xs$x2))
    rush$push_results(key, yss = list(ys))
  }
}

rush = rsh(
  network = "test-network",
  config = redux::redis_config())

We start two new daemons.

daemons(2)

After the daemons are started, we can start the remote workers.

worker_ids = rush$start_remote_workers(
  worker_loop = wl_random_search,
  n_workers = 2,
  branin = branin)

Warning:
✖ $start_remote_workers() is deprecated and will be removed in the future.
→ Class: Mlr3WarningDeprecated

rush$worker_info

       worker_id   pid      hostname heartbeat   state
          <char> <int>        <char>    <lgcl>  <char>
1: unrelative...  9326 runnervmwf...     FALSE running
2: worldlywis...  9324 runnervmwf...     FALSE running

We stop the daemons.

rush$reset()

Failed Workers

Failed workers started with mirai are also detected by the controller. We simulate a segfault on the worker by killing the worker process.

rush = rsh(network = "test-failed-mirai-workers")

wl_failed_worker = function(rush) {
  tools::pskill(Sys.getpid(), tools::SIGKILL)
}

Start two new daemons.

daemons(2)

Start two remote workers.

worker_ids = rush$start_remote_workers(
  worker_loop = wl_failed_worker,
  n_workers = 2)

rush$detect_lost_workers()

[1] "overrational_woollybearcaterpillar" "acrid_lhasaapso"

A segmentation fault also terminates the associated mirai daemon. Therefore, it is necessary to restart the daemon before restarting the workers.

daemons(2)

Script Workers

The most flexible method for starting workers is to use a script generated with the $worker_script() method. This script can be executed either on the local machine or on a remote machine. The only requirement is that the machine is capable of running R scripts and has access to the Redis database.

rush = rsh(
  network = "test-script-workers",
  config = redux::redis_config())

rush$worker_script(
  worker_loop = wl_random_search)

Error Handling

Workers started with processx or mirai are monitored by these packages. The heartbeat is a mechanism to monitor the status of script workers. The mechanism consists of a heartbeat key with a set expiration timeout and a dedicated heartbeat process that refreshes the timeout periodically. The heartbeat process is started with callr and is linked to the main process of the worker. In the event of a worker’s failure, the associated heartbeat process also ceases to function, thus halting the renewal of the timeout. The absence of the heartbeat key acts as an indicator to the controller that the worker is no longer operational. Consequently, the controller updates the worker’s status to "lost".

Heartbeats are initiated upon worker startup by specifying the heartbeat_period and heartbeat_expire parameters. The heartbeat_period defines the frequency at which the heartbeat process will update the timeout. The heartbeat_expire sets the duration, in seconds, before the heartbeat key expires. The expiration time should be set to a value greater than the heartbeat period to ensure that the heartbeat process has sufficient time to refresh the timeout.

rush$worker_script(
  worker_loop = wl_random_search,
  heartbeat_period = 1,
  heartbeat_expire = 3)

The heartbeat process is also the only way to kill a script worker. The $stop_workers(type = "kill") method pushes a kill signal to the heartbeat process. The heartbeat process terminates the main process of the worker.