--- title: "GPU-Accelerated Ordinary Differential Equations (ODE) in R with diffeqr" author: "Chris Rackauckas" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{GPU-Accelerated Ordinary Differential Equations (ODE) in R with diffeqr} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` In many cases one is interested in solving the same ODE many times over many different initial conditions and parameters. In diffeqr parlance this is called an ensemble solve. diffeqr inherits the parallelism tools of the [SciML ecosystem](https://sciml.ai/) that are used for things like [automated equation discovery and acceleration](https://arxiv.org/abs/2001.04385). Here we will demonstrate using these parallel tools to accelerate the solving of an ensemble. First, let's define the JIT-accelerated Lorenz equation like before: ```R de <- diffeqr::diffeq_setup() lorenz <- function (u,p,t){ du1 = p[1]*(u[2]-u[1]) du2 = u[1]*(p[2]-u[3]) - u[2] du3 = u[1]*u[2] - p[3]*u[3] c(du1,du2,du3) } u0 <- c(1.0,1.0,1.0) tspan <- c(0.0,100.0) p <- c(10.0,28.0,8/3) prob <- de$ODEProblem(lorenz,u0,tspan,p) fastprob <- diffeqr::jitoptimize_ode(de,prob) ``` Now we use the `EnsembleProblem` as defined on the [ensemble parallelism page of the documentation](https://diffeq.sciml.ai/stable/features/ensemble/): Let's build an ensemble by utilizing uniform random numbers to randomize the initial conditions and parameters: ```R prob_func <- function (prob,i,rep){ de$remake(prob,u0=runif(3)*u0,p=runif(3)*p) } ensembleprob = de$EnsembleProblem(fastprob, prob_func = prob_func, safetycopy=FALSE) ``` Now we solve the ensemble in serial: ```R sol = de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=10000,saveat=0.01) ``` To add GPUs to the mix, we need to bring in [DiffEqGPU](https://github.com/SciML/DiffEqGPU.jl). The `diffeqr::diffeqgpu_setup()` helper function will install CUDA for you and bring all of the bindings into the returned object: ```R degpu <- diffeqr::diffeqgpu_setup("CUDA") ``` #### Note: `diffeqr::diffeqgpu_setup` can take awhile to run the first time as it installs the drivers! Now we simply use `EnsembleGPUKernel(degpu$CUDABackend())` with a GPU-specialized ODE solver `GPUTsit5()` to solve 10,000 ODEs on the GPU in parallel: ```R sol <- de$solve(ensembleprob,degpu$GPUTsit5(),degpu$EnsembleGPUKernel(degpu$CUDABackend()),trajectories=10000,saveat=0.01) ``` For the full list of choices for specialized GPU solvers, see [the DiffEqGPU.jl documentation](https://docs.sciml.ai/DiffEqGPU/stable/manual/ensemblegpukernel/). Note that `EnsembleGPUArray` can be used as well, like: ```R sol <- de$solve(ensembleprob,de$Tsit5(),degpu$EnsembleGPUArray(degpu$CUDABackend()),trajectories=10000,saveat=0.01) ``` though we highly recommend the `EnsembleGPUKernel` methods for more speed. Given the way the JIT compilation performed will also ensure that the faster kernel generation methods work, `EnsembleGPUKernel` is almost certainly the better choice in most applications. ### Benchmark To see how much of an effect the parallelism has, let's test this against R's deSolve package. This is exactly the same problem as the documentation example for deSolve, so let's copy that verbatim and then add a function to do the ensemble generation: ```R library(deSolve) Lorenz <- function(t, state, parameters) { with(as.list(c(state, parameters)), { dX <- a * X + Y * Z dY <- b * (Y - Z) dZ <- -X * Y + c * Y - Z list(c(dX, dY, dZ)) }) } parameters <- c(a = -8/3, b = -10, c = 28) state <- c(X = 1, Y = 1, Z = 1) times <- seq(0, 100, by = 0.01) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } ``` Using `lapply` to generate the ensemble we get: ``` > system.time({ lapply(1:1000,lorenz_solve) }) user system elapsed 225.81 0.46 226.63 ``` Now let's see how the JIT-accelerated serial Julia version stacks up against that: ``` > system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=1000,saveat=0.01) }) user system elapsed 2.75 0.30 3.08 ``` Julia is already about 73x faster than the pure R solvers here! Now let's add GPU-acceleration to the mix: ``` > system.time({ de$solve(ensembleprob,degpu$GPUTsit5(),degpu$EnsembleGPUKernel(degpu$CUDABackend()),trajectories=1000,saveat=0.01) }) user system elapsed 0.11 0.00 0.12 ``` Already 26x times faster! But the GPU acceleration is made for massively parallel problems, so let's up the trajectories a bit. We will not use more trajectories from R because that would take too much computing power, so let's see what happens to the Julia serial and GPU at 10,000 trajectories: ``` > system.time({ de$solve(ensembleprob,de$Tsit5(),de$EnsembleSerial(),trajectories=10000,saveat=0.01) }) user system elapsed 35.02 4.19 39.25 ``` ``` > system.time({ de$solve(ensembleprob,degpu$GPUTsit5(),degpu$EnsembleGPUKernel(degpu$CUDABackend()),trajectories=10000,saveat=0.01) }) user system elapsed 1.22 0.23 1.50 ``` To compare this to the pure Julia code: ```julia using OrdinaryDiffEq, DiffEqGPU, CUDA, StaticArrays function lorenz(u, p, t) σ = p[1] ρ = p[2] β = p[3] du1 = σ * (u[2] - u[1]) du2 = u[1] * (ρ - u[3]) - u[2] du3 = u[1] * u[2] - β * u[3] return SVector{3}(du1, du2, du3) end u0 = SA[1.0f0; 0.0f0; 0.0f0] tspan = (0.0f0, 10.0f0) p = SA[10.0f0, 28.0f0, 8 / 3.0f0] prob = ODEProblem{false}(lorenz, u0, tspan, p) prob_func = (prob, i, repeat) -> remake(prob, p = (@SVector rand(Float32, 3)) .* p) monteprob = EnsembleProblem(prob, prob_func = prob_func, safetycopy = false) @time sol = solve(monteprob, GPUTsit5(), EnsembleGPUKernel(CUDA.CUDABackend()), trajectories = 10_000, saveat = 1.0f0); # 0.015064 seconds (257.68 k allocations: 13.132 MiB) ``` which is about two orders of magnitude faster for computing 10,000 trajectories, note that the major factors are that we cannot define 32-bit floating point values from R and the `prob_func` for generating the initial conditions and parameters is a major bottleneck since this function is written in R. To see how this scales in Julia, let's take it to insane heights. First, let's reduce the amount we're saving: ```julia @time sol = solve(monteprob,GPUTsit5(),EnsembleGPUKernel(CUDA.CUDABackend()),trajectories=10_000,saveat=1.0f0) 0.015040 seconds (257.64 k allocations: 13.130 MiB) ``` This highlights that controlling memory pressure is key with GPU usage: you will get much better performance when requiring less saved points on the GPU. ```julia @time sol = solve(monteprob,GPUTsit5(),EnsembleGPUKernel(CUDA.CUDABackend()),trajectories=100_000,saveat=1.0f0) # 0.150901 seconds (2.60 M allocations: 131.576 MiB) ``` compared to serial: ```julia @time sol = solve(monteprob,Tsit5(),EnsembleSerial(),trajectories=100_000,saveat=1.0f0) # 22.136743 seconds (16.40 M allocations: 1.628 GiB, 42.98% gc time) ``` And now we start to see that scaling power! Let's solve 1 million trajectories: ```julia @time sol = solve(monteprob,GPUTsit5(),EnsembleGPUKernel(CUDA.CUDABackend()),trajectories=1_000_000,saveat=1.0f0) # 1.031295 seconds (3.40 M allocations: 241.075 MiB) ``` For reference, let's look at deSolve with the change to only save that much: ```R times <- seq(0, 100, by = 1.0) lorenz_solve <- function (i){ state <- c(X = runif(1), Y = runif(1), Z = runif(1)) parameters <- c(a = -8/3 * runif(1), b = -10 * runif(1), c = 28 * runif(1)) out <- ode(y = state, times = times, func = Lorenz, parms = parameters) } system.time({ lapply(1:1000,lorenz_solve) }) ``` ``` user system elapsed 49.69 3.36 53.42 ``` The GPU version is solving 1000x as many trajectories, 50x as fast! So conclusion, if you need the most speed, you may want to move to the Julia version to get the most out of your GPU due to Float32's, and when using GPUs make sure it's a problem with a relatively average or low memory pressure, and these methods will give orders of magnitude acceleration compared to what you might be used to.