My Data Analysis Portfolio

Model Fitting

Load packages and cleaned data

#load required packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.1     ✔ tibble    3.1.8
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

here() starts at C:/Users/Katie/Documents/2022-2023/MADA/katiewells-MADA-portfolio

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.3     ✔ rsample      1.1.1
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.4     ✔ workflows    1.1.3
✔ modeldata    1.1.0     ✔ workflowsets 1.0.0
✔ parsnip      1.0.4     ✔ yardstick    1.1.0
✔ recipes      1.0.5     
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/

library(performance)


Attaching package: 'performance'

The following objects are masked from 'package:yardstick':

    mae, rmse

#load data
flu2 <- readRDS(here("fluanalysis", "data", "flu2.rds"))

Linear Modeling with Main Predictor Here we will use tidymodels to fit a linear model to BodyTemp, first with just the main predictor RunnyNose and later with all predictors.

#specify functional form of model
lm_mod <- linear_reg() %>% set_engine("lm") 
#Fit a linear model to the continuous outcome (BodyTemp) using only the main predictor of interest (RunnyNose)
lm_fit <- lm_mod %>% fit(BodyTemp ~ RunnyNose, data=flu2)
lm_fit

parsnip model object


Call:
stats::lm(formula = BodyTemp ~ RunnyNose, data = data)

Coefficients:
 (Intercept)  RunnyNoseYes  
     99.1431       -0.2926

#take a look at the results of lm_fit
glance(lm_fit)

# A tibble: 1 × 12
  r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
    <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
1  0.0123  0.0110  1.19    9.08 0.00268     1 -1162. 2329. 2343.   1031.     728
# … with 1 more variable: nobs <int>, and abbreviated variable names
#   ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual

Linear Modeling with all predictors

#Fit a linear model to the continuous outcome (BodyTemp) using all predictors
lm_fit2 <- lm_mod %>% 
          fit(BodyTemp ~ ., data = flu2)
lm_fit2

parsnip model object


Call:
stats::lm(formula = BodyTemp ~ ., data = data)

Coefficients:
         (Intercept)  SwollenLymphNodesYes    ChestCongestionYes  
           98.268642             -0.166066              0.102608  
     ChillsSweatsYes    NasalCongestionYes             SneezeYes  
            0.191826             -0.221568             -0.372709  
          FatigueYes    SubjectiveFeverYes           HeadacheYes  
            0.272509              0.436694              0.004977  
          Weakness.L            Weakness.Q            Weakness.C  
            0.254046              0.128981              0.023351  
    CoughIntensity.L      CoughIntensity.Q      CoughIntensity.C  
            0.170659             -0.166954              0.129975  
           Myalgia.L             Myalgia.Q             Myalgia.C  
           -0.132614             -0.131992              0.093532  
        RunnyNoseYes             AbPainYes          ChestPainYes  
           -0.064666              0.015375              0.100308  
         DiarrheaYes              EyePnYes           InsomniaYes  
           -0.152344              0.128786             -0.005824  
         ItchyEyeYes             NauseaYes              EarPnYes  
           -0.003889             -0.033246              0.111125  
      PharyngitisYes         BreathlessYes            ToothPnYes  
            0.315991              0.086379             -0.035497  
            VomitYes             WheezeYes  
            0.160053             -0.034654

#take a look at the results of lm_fit2
glance(lm_fit2)

# A tibble: 1 × 12
  r.squ…¹ adj.r…² sigma stati…³ p.value    df logLik   AIC   BIC devia…⁴ df.re…⁵
    <dbl>   <dbl> <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>   <int>
1   0.124  0.0853  1.14    3.19 2.47e-8    31 -1118. 2302. 2453.    914.     698
# … with 1 more variable: nobs <int>, and abbreviated variable names
#   ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual

Comparing models I was struggling to figure out how to compare these models with code. I did some searching and came across the package performance(). I’m going to use the compare_performance() function to do this.

compare_performance(lm_fit, lm_fit2)

# Comparison of Model Performance Indices

Name    | Model |  AIC (weights) | AICc (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
------------------------------------------------------------------------------------------------------
lm_fit  |   _lm | 2329.3 (<.001) | 2329.4 (<.001) | 2343.1 (>.999) | 0.012 |     0.011 | 1.188 | 1.190
lm_fit2 |   _lm | 2301.6 (>.999) | 2304.8 (>.999) | 2453.1 (<.001) | 0.124 |     0.085 | 1.119 | 1.144

Logistic Modeling with Main predictor Here we will use tidymodels to fit a logistic model to Nausea, first with just the main predictor RunnyNose and later with all predictors.

#specify functional form of model
glm_mod <- logistic_reg() %>%
  set_engine("glm")
#Fit a logistic model to the categorical outcome (Nausea) using only the main predictor of interest (RunnyNose)
glm_fit <- glm_mod %>% 
          fit(Nausea ~ RunnyNose, data = flu2)
glm_fit

parsnip model object


Call:  stats::glm(formula = Nausea ~ RunnyNose, family = stats::binomial, 
    data = data)

Coefficients:
 (Intercept)  RunnyNoseYes  
    -0.65781       0.05018  

Degrees of Freedom: 729 Total (i.e. Null);  728 Residual
Null Deviance:      944.7 
Residual Deviance: 944.6    AIC: 948.6

#take a look athe the results of glm_fit
glance(glm_fit)

# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          945.     729  -472.  949.  958.     945.         728   730

Logistic modeling with all predictors

#Fit a logistic model to the categorical outcome (Nausea) using all predictors
glm_fit2 <- glm_mod %>% 
          fit(Nausea ~ ., data = flu2)
glm_fit2

parsnip model object


Call:  stats::glm(formula = Nausea ~ ., family = stats::binomial, data = data)

Coefficients:
         (Intercept)  SwollenLymphNodesYes    ChestCongestionYes  
             0.05966              -0.24685               0.26529  
     ChillsSweatsYes    NasalCongestionYes             SneezeYes  
             0.29057               0.44007               0.16502  
          FatigueYes    SubjectiveFeverYes           HeadacheYes  
             0.23096               0.25502               0.33749  
          Weakness.L            Weakness.Q            Weakness.C  
             0.65600               0.31281              -0.11280  
    CoughIntensity.L      CoughIntensity.Q      CoughIntensity.C  
            -0.77141              -0.11999              -0.13857  
           Myalgia.L             Myalgia.Q             Myalgia.C  
             0.13177              -0.03511              -0.10548  
        RunnyNoseYes             AbPainYes          ChestPainYes  
             0.05264               0.94153               0.07153  
         DiarrheaYes              EyePnYes           InsomniaYes  
             1.06479              -0.32772               0.08410  
         ItchyEyeYes              EarPnYes        PharyngitisYes  
            -0.05269              -0.16544               0.29154  
       BreathlessYes            ToothPnYes              VomitYes  
             0.53063               0.48015               2.44716  
           WheezeYes              BodyTemp  
            -0.27581              -0.03136  

Degrees of Freedom: 729 Total (i.e. Null);  698 Residual
Null Deviance:      944.7 
Residual Deviance: 752.1    AIC: 816.1

glance(glm_fit2)

# A tibble: 1 × 8
  null.deviance df.null logLik   AIC   BIC deviance df.residual  nobs
          <dbl>   <int>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1          945.     729  -376.  816.  963.     752.         698   730

Comparing models

compare_performance(glm_fit, glm_fit2)

# Comparison of Model Performance Indices

Name     | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2 |  RMSE | Sigma | Log_loss | Score_log | Score_spherical |   PCP
----------------------------------------------------------------------------------------------------------------------------------------------
glm_fit  |  _glm | 948.6 (<.001) |  948.6 (<.001) | 957.8 (0.936) | 1.169e-04 | 0.477 | 1.139 |    0.647 |  -107.871 |           0.012 | 0.545
glm_fit2 |  _glm | 816.1 (>.999) |  819.2 (>.999) | 963.1 (0.064) |     0.246 | 0.415 | 1.038 |    0.515 |      -Inf |           0.002 | 0.657

The model with all predictors has a lower AIC, so it appears to be the better model.