Overview
tq_apply() provides a simplified workflow for running
parallel tasks on HPC clusters. It combines multiple steps (project
creation, resource assignment, task addition, and worker scheduling)
into a single function call, similar to base R’s lapply()
or sapply().
This is the easiest way to get started with taskqueue if
you:
- Have a simple function to run multiple times
- Don’t need complex project management
- Want to quickly parallelize work on an HPC cluster
Before using taskqueue, ensure you have:
PostgreSQL installed and configured (see PostgreSQL Setup vignette)
SSH access configured for remote resources (see SSH Setup vignette)
-
Database initialized:
-
A resource already defined:
resource_add( name = "hpc", type = "slurm", host = "hpc.example.com", nodename = "hpc", workers = 500, log_folder = "/home/user/log_folder/" )
Basic Usage
The simplest use of tq_apply() requires just a few
arguments:
library(taskqueue)
# Define your function
my_simulation <- function(i) {
# Your computation here
result <- i^2
Sys.sleep(1) # Simulate some work
return(result)
}
# Run 100 tasks in parallel
tq_apply(
n = 100,
fun = my_simulation,
project = "my_project",
resource = "hpc"
)This will:
- Create or update the project “my_project”
- Add the resource “hpc” to the project
- Create 100 tasks
- Schedule workers on the SLURM cluster
- Execute
my_simulation(1),my_simulation(2), …,my_simulation(100)in parallel
Function Arguments
Required Arguments
-
n: Number of tasks to run (integer) -
fun: The function to execute for each task -
project: Project name (string) -
resource: Resource name (string, must already exist)
Optional Arguments
-
memory: Memory per task in GB (default: 10) -
hour: Maximum runtime in hours (default: 24) -
account: Account name for cluster billing (optional) -
working_dir: Working directory on cluster (default:getwd()) -
...: Additional arguments passed to your function
Passing Arguments to Your Function
You can pass additional arguments to your function using
...:
my_function <- function(i, multiplier, offset = 0) {
result <- i * multiplier + offset
return(result)
}
tq_apply(
n = 50,
fun = my_function,
project = "test_args",
resource = "hpc",
multiplier = 10, # Passed to my_function
offset = 5 # Passed to my_function
)Each task will call: - Task 1:
my_function(1, multiplier = 10, offset = 5) - Task 2:
my_function(2, multiplier = 10, offset = 5) - And so
on…
Complete Example
Here’s a practical example running a Monte Carlo simulation:
library(taskqueue)
# Define simulation function
run_monte_carlo <- function(task_id, n_samples = 10000, seed_base = 12345) {
# Set unique seed for each task
set.seed(seed_base + task_id)
# Run simulation
samples <- rnorm(n_samples)
result <- list(
task_id = task_id,
mean = mean(samples),
sd = sd(samples),
quantiles = quantile(samples, probs = c(0.025, 0.5, 0.975))
)
# Save results
out_file <- sprintf("results/simulation_%04d.Rds", task_id)
dir.create("results", showWarnings = FALSE)
saveRDS(result, out_file)
return(invisible(NULL))
}
# Run 1000 simulations in parallel
tq_apply(
n = 1000,
fun = run_monte_carlo,
project = "monte_carlo_study",
resource = "hpc",
memory = 8, # 8 GB per task
hour = 2, # 2 hour time limit
working_dir = "/home/user/monte_carlo",
n_samples = 50000, # Argument for run_monte_carlo
seed_base = 99999 # Argument for run_monte_carlo
)Monitoring Progress
After calling tq_apply(), monitor your tasks:
# Check task status
task_status("monte_carlo_study")
# Check overall project status
project_status("monte_carlo_study")Collecting Results
After all tasks complete, collect your results:
# Read all result files
result_files <- list.files("results", pattern = "simulation_.*\\.Rds$",
full.names = TRUE)
# Combine results
all_results <- lapply(result_files, readRDS)
# Analyze
means <- sapply(all_results, function(x) x$mean)
hist(means, main = "Distribution of Means")Best Practices
1. Save Results to Files
Your function should save results to the file system:
my_task <- function(i) {
out_file <- sprintf("output/result_%04d.Rds", i)
# Skip if already done
if (file.exists(out_file)) {
return(invisible(NULL))
}
# Do computation
result <- expensive_computation(i)
# Save result
saveRDS(result, out_file)
}2. Make Functions Idempotent
Check if output already exists to avoid re-running completed tasks:
my_task <- function(i) {
out_file <- sprintf("output/task_%d.Rds", i)
if (file.exists(out_file)) return(invisible(NULL))
# ... computation and save
}3. Specify Working Directory
Ensure your working directory on the cluster is correct:
tq_apply(
n = 100,
fun = my_function,
project = "my_project",
resource = "hpc",
working_dir = "/home/user/project_folder"
)4. Set Appropriate Resources
Configure memory and time limits based on your task requirements:
tq_apply(
n = 100,
fun = memory_intensive_task,
project = "big_analysis",
resource = "hpc",
memory = 64, # 64 GB for large tasks
hour = 48 # 48 hour time limit
)Comparison with Manual Workflow
tq_apply() simplifies the workflow by combining these
steps:
Manual approach:
# Multiple steps
project_add("test", memory = 10)
project_resource_add("test", "hpc", working_dir = "/path", hours = 24)
task_add("test", num = 100, clean = TRUE)
project_reset("test")
worker_slurm("test", "hpc", fun = my_function)With tq_apply():
# Single step
tq_apply(n = 100, fun = my_function, project = "test", resource = "hpc",
working_dir = "/path", hour = 24)Troubleshooting
Tasks fail immediately: - Check the log folder specified in your resource configuration - Verify your function works locally first - Ensure the working directory exists on the cluster
Tasks remain in “idle” status: - Check that the
project is started: project_start("my_project") - Verify
the resource is correctly configured - Check SLURM queue:
squeue -u $USER
“Resource not found” error: - The resource must be
created before using tq_apply() - Use
resource_list() to see available resources - Create
resource with resource_add()
When to Use tq_apply()
Use tq_apply() when: - You have a
simple parallel task - You want to quickly run many iterations of a
function - You don’t need fine-grained control over project settings
Use the manual workflow when: - You need to manage multiple projects simultaneously - You want to reuse a project for different task sets - You need more control over resource scheduling - You’re running different types of tasks in the same project