--- title: "Hyperparameters Tuning" author: "Gabriele Pittarello" date: "`r Sys.Date()`" bibliography: '`r system.file("references.bib", package="ReSurv")`' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Hyperparameters Tuning} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r eval=FALSE, include=TRUE} knitr::opts_chunk$set(echo = TRUE) library(ReSurv) library(reticulate) use_virtualenv("pyresurv") ``` # Introduction This vignette shows how to tune the hyperparameters of the machine learning algorithms implemented in `ReSurv` using the approach in @snoek12 implemented in the R package `ParBayesianOptimization` (@ParBayesianOptimization). For illustrative purposes, we will simulate daily claim notifications from one of the scenarios introduced in @hiabu23 (scenario Alpha). ```{r eval=FALSE, include=TRUE} input_data_0 <- data_generator( random_seed = 1964, scenario = 0, time_unit = 1 / 360, years = 4, period_exposure = 200 ) ``` The counts data are then pre-processed using the `IndividualDataPP` function. ```{r eval=FALSE, include=TRUE} individual_data <- IndividualDataPP( data = input_data_0, id = NULL, categorical_features = "claim_type", continuous_features = "AP", accident_period = "AP", calendar_period = "RP", input_time_granularity = "days", output_time_granularity = "quarters", years = 4 ) ``` # Optimal hyperparameters In the manuscript we fit the proportional likelihood of a cox model using extreme gradient boosting (XGB) and feed-forward neural networks (NN). The advantage of using the bayesian hyperparameters selection algorithm is in terms the computation time, and of a wide range of parameter choices (see again @snoek12). We show the ranges we inspected in the manuscript in the following table. | Model | Hyperparameter | Range | |-------|-----------------------------|-------------------------| | NN | `num_layers` | $\left[2,10\right]$ | | | `num_nodes` | $\left[2,10\right]$ | | | `optim` | $\left[1,2\right]$ | | | `activation` | $\left[1,2\right]$ | | | `lr` | $\left[.005,0.5\right]$ | | | `xi` | $\left[0,0.5\right]$ | | | `eps` | $\left[0,0.5\right]$ | | XGB | `eta` | $\left[0,1\right]$ | | | `max_depth` | $\left[0,25\right]$ | | | `min_child_weight` | $\left[0,50\right]$ | | | `lambda` | $\left[0,50\right]$ | | | `alpha` | $\left[0,50\right]$ | In the following part of this vignette, we will discuss the steps required to optimize the hyperparameters with NN using the approach of @snoek12. We will provide the code we used for XGB as it follows a similar flow. ## Notes on K-Fold cross validation In `ReSurv`, we have our own implementation of a standard K-Fold cross-validation (@hastie09), namely the `ReSurvCV` method of an `IndividualDataPP` object. We show an illustrative example for XGB below. Even if this routine can be used alone for choosing the optimal parameters, we suggest to use it within the bayesian approach of @snoek12. An example is provided in the following chunk for NN. ```{r eval=FALSE, include=TRUE} resurv_cv_xgboost <- ReSurvCV( IndividualDataPP = individual_data, model = "XGB", hparameters_grid = list( booster = "gbtree", eta = c(.001, .01, .2, .3), max_depth = c(3, 6, 8), subsample = c(1), alpha = c(0, .2, 1), lambda = c(0, .2, 1), min_child_weight = c(.5, 1) ), print_every_n = 1L, nrounds = 500, verbose = FALSE, verbose.cv = TRUE, early_stopping_rounds = 100, folds = 5, parallel = T, ncores = 2, random_seed = 1 ) ``` ## Neural networks The `ReSurv` neural network implementation uses `reticulate` to interface R Studio to Python and it is based on a similar approach to @katzman18, corrected to account for left-truncation and ties in the data. Similarly to the original implementation we relied on the Python library `pytorch` (@pytorch). The syntax of our NN is then the syntax of `pytorch`. See the [reference guide](https://pytorch.org/) for further information on the NN parametrization. In order to use the `ParBayesianOptimization` package, we first need to specify the NN parameter ranges we are interested into a vector. We discussed in the last section the ranges we choose for each algorithm. ```{r eval=FALSE, include=TRUE} bounds <- list( num_layers = c(2L, 10L), num_nodes = c(2L, 10L), optim = c(1L, 2L), activation = c(1L, 2L), lr = c(.005, 0.5), xi = c(0, 0.5), eps = c(0, 0.5) ) ``` Secondly, we need to specify an objective function to be optimized with the Bayesian approach. The `ParBayesianOptimization` package can be loaded as ```{r eval=FALSE, include=TRUE} library(ParBayesianOptimization) ``` The score metric we inspect is the negative (partial) likelihood. The likelihood is returned with negative sign as @ParBayesianOptimization is maximizing the objective function. ```{r eval=FALSE, include=TRUE} obj_func <- function(num_layers, num_nodes, optim, activation, lr, xi, eps) { optim = switch(optim, "Adam", "SGD") activation = switch(activation, "LeakyReLU", "SELU") batch_size = as.integer(5000) number_layers = as.integer(num_layers) num_nodes = as.integer(num_nodes) deepsurv_cv <- ReSurvCV( IndividualDataPP = individual_data, model = "NN", hparameters_grid = list( num_layers = num_layers, num_nodes = num_nodes, optim = optim, activation = activation, lr = lr, xi = xi, eps = eps, tie = "Efron", batch_size = batch_size, early_stopping = 'TRUE', patience = 20 ), epochs = as.integer(300), num_workers = 0, verbose = FALSE, verbose.cv = TRUE, folds = 3, parallel = FALSE, random_seed = as.integer(Sys.time()) ) lst <- list( Score = -deepsurv_cv$out.cv.best.oos$test.lkh, train.lkh = deepsurv_cv$out.cv.best.oos$train.lkh ) return(lst) } ``` As a last step, we use the `bayesOpt` function to perform the optimization. ```{r eval=FALSE, include=TRUE} bayes_out <- bayesOpt( FUN = obj_func, bounds = bounds, initPoints = 50, iters.n = 1000, iters.k = 50, otherHalting = list(timeLimit = 18000) ) ``` To select the optimal hyperparameters we inspect `bayes_out$scoreSummary` output. Below we print the first five rows of one of our runs. Observe `scoreSummary` is a `data.table` that contains some parameters specific of the original implementation (see @ParBayesianOptimization for more details) | Epoch | Iteration | gpUtility | acqOptimum | inBounds | errorMessage | |-------|-----------|-----------|------------|----------|--------------| | 0 | 1 | | FALSE | TRUE | | | 0 | 2 | | FALSE | TRUE | | | 0 | 3 | | FALSE | TRUE | | | 0 | 4 | | FALSE | TRUE | | | 0 | 5 | | FALSE | TRUE | | and the parameters we specified during the optimization: | num_layers | num_nodes | optim | activation | lr | xi | eps | batch_size | Elapsed | Score | train.lkh | |-------------|------------|-------|------------|------|------|------|-------------|---------|-------|-----------| | 9 | 8 | 1 | 2 | 0.08 | 0.35 | 0.03 | 1226 | 6094.91 | -6.24 | 6.28 | | 9 | 2 | 2 | 1 | 0.47 | 0.50 | 0.10 | 3915 | 7307.31 | -7.27 | 7.30 | | 9 | 9 | 2 | 1 | 0.40 | 0.49 | 0.18 | 196 | 6719.70 | -5.98 | 5.97 | | 8 | 8 | 1 | 2 | 0.03 | 0.23 | 0.01 | 4508 | 8893.46 | -7.39 | 7.41 | | 9 | 7 | 2 | 1 | 0.13 | 0.13 | 0.12 | 900 | 3189.15 | -6.21 | 6.23 | We select the final combination that minimizes the negative (partial) likelihood, in the `Score` column. ## Extreme gradient boosting utilzing parallel computing In a similar fashion, we can optimize the gradient boosting parameters. We first set the hyperparameters grid. ```{r eval=FALSE, include=TRUE} bounds <- list( eta = c(0, 1), max_depth = c(1L, 25L), min_child_weight = c(0, 50), subsample = c(0.51, 1), lambda = c(0, 15), alpha = c(0, 15) ) ``` Secondly, we define an objective function. ```{r eval=FALSE, include=TRUE} obj_func <- function(eta, max_depth, min_child_weight, subsample, lambda, alpha) { xgbcv <- ReSurvCV( IndividualDataPP = individual_data, model = "XGB", hparameters_grid = list( booster = "gbtree", eta = eta, max_depth = max_depth, subsample = subsample, alpha = lambda, lambda = alpha, min_child_weight = min_child_weight ), print_every_n = 1L, nrounds = 500, verbose = FALSE, verbose.cv = TRUE, early_stopping_rounds = 30, folds = 3, parallel = FALSE, random_seed = as.integer(Sys.time()) ) lst <- list( Score = -xgbcv$out.cv.best.oos$test.lkh, train.lkh = xgbcv$out.cv.best.oos$train.lkh ) return(lst) } ``` Lastly, we perform the optimization in a parallel setting using the `DoParallel` package, as recommended in 'BayesOpt'. ```{r eval=FALSE, include=TRUE} library(DoParallel) cl <- makeCluster(parallel::detectCores()) registerDoParallel(cl) clusterEvalQ(cl, { library("ReSurv") }) bayes_out <- bayesOpt( FUN = obj_func , bounds = bounds , initPoints = length(bounds) + 20 , iters.n = 1000 , iters.k = 50 , otherHalting = list(timeLimit = 18000) , parallel = TRUE ) ``` # Bibliography