Perception of English Voice Quality

Random Forest Analysis

Author

Mykel Brinkerhoff

Published

August 27, 2025

Data

The voice quality and classification data comes from VoicesVQ_data and b-c_voice_cluster_labels.

The dataset VoicesVQ_data contains acoustic measures extracted from the audio files used in the perception experiment. These measures were extracted using VoiceSauce

The dataset b-c_voice_cluster_labels contains the classification labels that participants assigned to each speaker.

These data were combined so each line of VoicesVQ_data contained the classification label based on subject id. This was stored as a new dataframe called vq_raw. This dataframe was cleaned of outliers following Brinekrhoff & McGuire 2025.

Data was then standardized to make comparisons easier. Energy was first log-transformed to account of its right tail before being z-scored. SoE was normalized following Garellek et al. 2020. H1* was normalized following Chai & Garellek 2022 and replaced z-scored H1*.

Uncorrected acoustic measures, the non-STRAIGHT f0, and non-snack formants and bandwidths were removed.

This resulted in a dataframe with the following columns:

  • Gender: factor indicating the talkers’ gender
  • Talker: factor indicating the speakers’ id
  • Vowel: factor indicating which vowel was spoken
  • Word: character indicating which word was spoken
  • Filename: character indicating the filename of the audio file
  • Label: character indicating the classification label
  • Phonation: character indicating the phonation type
  • seg_Start: numeric indicating the start time of the segment
  • seg_End: numeric indicating the end time of the segment
  • H2c_mean_z: numeric indicating the z-scored H2c mean
  • H4c_mean_z: numeric indicating the z-scored H4c mean
  • A1c_mean_z: numeric indicating the z-scored A1c mean
  • A2c_mean_z: numeric indicating the z-scored A2c mean
  • A3c_mean_z: numeric indicating the z-scored A3c mean
  • H2Kc_mean_z: numeric indicating the z-scored H2Kc mean
  • H1H2c_mean_z: numeric indicating the z-scored H1H2c mean
  • H2H4c_mean_z: numeric indicating the z-scored H2H4c mean
  • H1A1c_mean_z: numeric indicating the z-scored H1A1c mean
  • H1A2c_mean_z: numeric indicating the z-scored H1A2c mean
  • H1A3c_mean_z: numeric indicating the z-scored H1A3c mean
  • H42Kc_mean_z: numeric indicating the z-scored H42Kc mean
  • H2KH5Kc_mean_z: numeric indicating the z-scored H2KH5Kc mean
  • CPP_mean_z: numeric indicating the z-scored CPP mean
  • HNR05_mean_z: numeric indicating the z-scored HNR05 mean
  • HNR15_mean_z: numeric indicating the z-scored HNR15 mean
  • HNR25_mean_z: numeric indicating the z-scored HNR25 mean
  • HNR35_mean_z: numeric indicating the z-scored HNR35 mean
  • SHR_mean_z: numeric indicating the z-scored SHR mean
  • strF0_mean_z: numeric indicating the z-scored strF0 mean
  • sF1_mean_z: numeric indicating the z-scored sF1 mean
  • sF2_mean_z: numeric indicating the z-scored sF2 mean
  • sF3_mean_z: numeric indicating the z-scored sF3 mean
  • sF4_mean_z: numeric indicating the z-scored sF4 mean
  • sB1_mean_z: numeric indicating the z-scored sB1 mean
  • sB2_mean_z: numeric indicating the z-scored sB2 mean
  • sB3_mean_z: numeric indicating the z-scored sB3 mean
  • sB4_mean_z: numeric indicating the z-scored sB4 mean
  • epoch_mean_z: numeric indicating the z-scored epoch mean
  • log_energy_z: numeric indicating the z-scored log energy mean
  • soe_norm: numeric indicating the normalized SoE
  • resid_H1c: numeric indicating the residual H1c

Analysis

Preparing for analysis

I removed the columns Gender, Talker, Vowel, Word, Filename, Label,seg_Start, and seg_End from the dataframe and factorized Phonation for Random Forest analysis.

Splitting into test and training

Show the code
# remove all columns except Phonation and the standardized data
    vq_resid <- vq_resid |>
      dplyr::select(
        Phonation,
        dplyr::ends_with("_z"),
        soe_norm,
        resid_H1c
      ) |>
      dplyr::rename_with(~ stringr::str_remove_all(., "_mean_z$|_z$|_norm$")) |>
      dplyr::select(-dplyr::matches("u$"))

# stratified sampling with the rsample package with respect to phonation
set.seed(123) # needed for reproducibility
resid_split <- rsample::initial_split(
  vq_resid,
  prop = 0.7,
  strata = "Phonation"
)
resid_train <- rsample::training(resid_split)
resid_test <- rsample::testing(resid_split)

Data was randommly split into a testing and training dataset. The proporations of each phonation type was maintained in each dataset.

Show the code
table(resid_train$Phonation) |>
  prop.table()

  breathy    creaky     modal 
0.3270869 0.2095400 0.4633731 
Show the code
table(resid_test$Phonation) |>
  prop.table()

  breathy    creaky     modal 
0.3267717 0.2125984 0.4606299 

Tuning the model

Random Forest models need to have the correct parameter settings for the analysis to work. These parameters were chosen based on a grid search and the model with the lowest out-of-bag (OOB) error was chosen.

First a default model was trained to determine a baseline prediction error. Then a hypergrid of parameters was created and a model was trained for each combination of parameters. The OOB error for each model was recorded and the model with the lowest OOB error was chosen.

The parameters that were tuned were: - mtry: number of variables to possibly split at in each node equal to - 5% of the total number of features - 15% of the total number of features - 20% of the total number of features - 25% of the total number of features - 33.3% of the total number of features (default) - 40% of the total number of features - 100% of the total number of features - min.node.size: minimal node size equal to - 1 - 3 - 5 - 10 - replace: whether to sample with replacement - TRUE - FALSE - sample.fraction: fraction of the data to sample at each split equal to: - 50% - 63% - 80% - num.trees: number of trees to grow (not tuned here, but tuned in a subsequent step).

Show the code
# determine the number of features
resid_features <- length(setdiff(names(vq_resid), "Phonation"))

# Train a default Random Forest model for comparison
resid_default <- ranger::ranger(
  formula = Phonation ~ .,
  data = resid_train,
  num.trees = resid_features * 100,
  mtry = floor(sqrt(resid_features)),
  respect.unordered.factors = "order",
  seed = 123
)

# default prediction error for comparison
resid_error <- resid_default$prediction.error

# generate hypergrid paramaters
hypergrid_resid <- expand.grid(
  mtry = floor(resid_features * c(.05, .15, .2, .25, .333, .4, 1)),
  min.node.size = c(1, 3, 5, 10),
  replace = c(TRUE, FALSE),
  sample.fraction = c(.5, .63, .8),
  error = NA
)

# execute full grid search
for (i in seq_len(nrow(hypergrid_resid))) {
  # fit the model with the i-th hyperparameter combonation
  fit_resid <- ranger::ranger(
    formula = Phonation ~ .,
    data = resid_train,
    num.trees = resid_features * 100,
    mtry = hypergrid_resid$mtry[i],
    min.node.size = hypergrid_resid$min.node.size[i],
    replace = hypergrid_resid$replace[i],
    sample.fraction = hypergrid_resid$sample.fraction[i],
    respect.unordered.factors = "order",
    seed = 123
  )

  # export OOB RMSE
  hypergrid_resid$error[i] <- fit_resid$prediction.error
}

This results in a hypergrid with the OOB error for each combination of parameters. The top 10 models with the lowest OOB error are shown below. The percentage gain is calculated as the reduction in error compared to the default model.

Show the code
# assessing the model parameters
hypergrid_resid |>
  dplyr::arrange(error) |>
  dplyr::mutate(perc_gain = (resid_error - error) / resid_error * 100) |>
  head(10)
   mtry min.node.size replace sample.fraction      error perc_gain
1    32             1   FALSE            0.63 0.08177172  49.47368
2    12             5   FALSE            0.80 0.08177172  49.47368
3    32             3   FALSE            0.63 0.08347530  48.42105
4    32             5   FALSE            0.63 0.08347530  48.42105
5    32             3    TRUE            0.80 0.08347530  48.42105
6    32             5    TRUE            0.80 0.08347530  48.42105
7    32             1   FALSE            0.50 0.08517888  47.36842
8    12             1   FALSE            0.80 0.08517888  47.36842
9    12             3   FALSE            0.80 0.08517888  47.36842
10   32             1    TRUE            0.80 0.08688245  46.31579

Based on the results of the grid search, the following parameters were chosen for the final model: - mtry = 12 - min.node.size = 5 - replace = FALSE - sample.fraction = 0.8

There were two models with the same OOB error, but I chose the one with mtry = 12 instead of mtry = 32. The rationale for this if we went with mtry = 32, then each tree would be considering all the features at each split. This would also make the model equivalent to a bagging model. By choosing mtry = 12, we allow for more diversity among the trees, which can lead to better overall performance. This is further supported when we tune for the number of trees to use for the final model, which shows that the model with mtry = 12 performs better than the model with mtry = 32.

Tuning the number of trees

The number of trees was tuned separately from the other parameters. A hypergrid was created with the number of trees ranging from 50 to 2000 in increments of 50. The mtry parameter was set to the same values as before. A model was trained for each combination of parameters and the OOB error was recorded. The model with the lowest OOB error was chosen.

Show the code
# create a hypergrid for number of trees
hypergrid_resid_trees <- expand.grid(
  num.trees = seq(50, 2000, 50),
  mtry = floor(resid_features * c(.05, .15, .2, .25, .333, .4, 1)),
  accuracy = NA
)

# perform grid search for correct number of trees
for (i in seq_len(nrow(hypergrid_resid_trees))) {
  # fit the model with i-th hyperparameter combination
  fit_resid_trees <- ranger::ranger(
    formula = Phonation ~ .,
    data = resid_train,
    num.trees = hypergrid_resid_trees$num.trees[i],
    mtry = hypergrid_resid_trees$mtry[i],
    min.node.size = 5,
    replace = FALSE,
    sample.fraction = 0.8,
    respect.unordered.factors = "order",
    seed = 123
  )

  # export OOB RMSE
  hypergrid_resid_trees$accuracy[i] <- fit_resid_trees$prediction.error
}

In the table below, the top 10 models with the lowest OOB error are shown. The percentage gain is calculated as the reduction in error compared to the default model. This shows that the model with num.trees = 650 and mtry = 12 has the lowest OOB error.

Show the code
# assessing the model parameters
hypergrid_resid_trees |>
  dplyr::arrange(accuracy) |>
  dplyr::mutate(perc_gain = (resid_error - accuracy) / resid_error * 100) |>
  head(10)
   num.trees mtry   accuracy perc_gain
1        650   12 0.08006814  50.52632
2        850   12 0.08177172  49.47368
3       1450   12 0.08177172  49.47368
4       1650   12 0.08177172  49.47368
5        400   12 0.08347530  48.42105
6        750   12 0.08347530  48.42105
7       1150   12 0.08347530  48.42105
8       1500   12 0.08347530  48.42105
9       1550   12 0.08347530  48.42105
10      1850   12 0.08347530  48.42105

This is further visualized in the plot below. The model with num.trees = 650 and mtry = 12 has the lowest OOB error.

Show the code
# plotting the results
hypergrid_resid_trees |>
  ggplot2::ggplot(aes(x = num.trees, y = accuracy, color = factor(mtry))) +
  ggplot2::geom_line(linewidth = 1) +
  # geom_line(aes(linetype = factor(mtry)), linewidth = 1) +
  ggplot2::labs(
    title = "Prediction error for Random Forest Hyperparameter Tuning",
    x = "number of trees",
    y = "% incorrect",
    color = "mtry"
  ) +
  ggplot2::scale_color_manual(values = colorblind) +
  ggplot2::theme_bw()

OOB error for different number of trees and mtry values

Final model and variable importance

The final model was trained using the optimal hyperparameters identified during the tuning process. The variable importance was assessed using both impurity and permutation-based methods.

Show the code
# final random forest model with impurity
resid_final_impurity <- ranger::ranger(
  formula = Phonation ~ .,
  data = resid_train,
  num.trees = 650,
  mtry = 12,
  min.node.size = 5,
  replace = FALSE,
  sample.fraction = 0.8,
  respect.unordered.factors = "order",
  seed = 123,
  classification = TRUE,
  importance = "impurity",
  probability = TRUE
)

# final model with permutation
resid_final_permutation <- ranger::ranger(
  formula = Phonation ~ .,
  data = resid_train,
  num.trees = 650,
  mtry = 12,
  min.node.size = 5,
  replace = FALSE,
  sample.fraction = 0.8,
  respect.unordered.factors = "order",
  seed = 123,
  classification = TRUE,
  importance = "permutation",
  probability = TRUE
)

The variable importance scores were extracted from both models and visualized using lollipop charts.

Show the code
# Extract variable importance scores for impurity-based importance
resid_impurity_scores <- vip::vi(resid_final_impurity)

# Extract variable importance scores for permutation-based importance
resid_permutation_scores <- vip::vi(resid_final_permutation)


# Create a Lollipop chart of variable importance scores
resid_impurity_plot <- resid_impurity_scores |>
  dplyr::rename_with(~ stringr::str_remove_all(., "_mean_z$|_z$|_norm$")) |>
  ggplot2::ggplot(
    aes(x = reorder(Variable, Importance), y = Importance)
  ) +
  geom_segment(aes(xend = Variable, yend = 0)) +
  geom_point(size = 2) +
  coord_flip() +
  labs(
    title = "Impurity Importance",
    x = "Variable",
    y = "Importance (Impurity)"
  ) +
  theme_bw()

resid_permutation_plot <- resid_permutation_scores |>
  ggplot2::ggplot(
    aes(x = reorder(Variable, Importance), y = Importance)
  ) +
  geom_segment(aes(xend = Variable, yend = 0)) +
  geom_point(size = 2) +
  coord_flip() +
  labs(
    title = "Permutation Importance",
    x = "Variable",
    y = "Importance (Permutation)"
  ) +
  theme_bw()

resid_variable_importance_plot <- cowplot::plot_grid(
  resid_impurity_plot,
  resid_permutation_plot,
  nrow = 1
)
resid_variable_importance_plot

Variable importance scores for the final Random Forest model using impurity and permutation methods