Model Fitting
Load packages and cleaned data
#load required packages
library (tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.1 ✔ tibble 3.1.8
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at C:/Users/Katie/Documents/2022-2023/MADA/katiewells-MADA-portfolio
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.3 ✔ rsample 1.1.1
✔ dials 1.1.0 ✔ tune 1.0.1
✔ infer 1.0.4 ✔ workflows 1.1.3
✔ modeldata 1.1.0 ✔ workflowsets 1.0.0
✔ parsnip 1.0.4 ✔ yardstick 1.1.0
✔ recipes 1.0.5
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Learn how to get started at https://www.tidymodels.org/start/
Attaching package: 'performance'
The following objects are masked from 'package:yardstick':
mae, rmse
#load data
flu2 <- readRDS (here ("fluanalysis" , "data" , "flu2.rds" ))
Linear Modeling with Main Predictor Here we will use tidymodels to fit a linear model to BodyTemp, first with just the main predictor RunnyNose and later with all predictors.
#specify functional form of model
lm_mod <- linear_reg () %>% set_engine ("lm" )
#Fit a linear model to the continuous outcome (BodyTemp) using only the main predictor of interest (RunnyNose)
lm_fit <- lm_mod %>% fit (BodyTemp ~ RunnyNose, data= flu2)
lm_fit
parsnip model object
Call:
stats::lm(formula = BodyTemp ~ RunnyNose, data = data)
Coefficients:
(Intercept) RunnyNoseYes
99.1431 -0.2926
#take a look at the results of lm_fit
glance (lm_fit)
# A tibble: 1 × 12
r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.0123 0.0110 1.19 9.08 0.00268 1 -1162. 2329. 2343. 1031. 728
# … with 1 more variable: nobs <int>, and abbreviated variable names
# ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
Linear Modeling with all predictors
#Fit a linear model to the continuous outcome (BodyTemp) using all predictors
lm_fit2 <- lm_mod %>%
fit (BodyTemp ~ ., data = flu2)
lm_fit2
parsnip model object
Call:
stats::lm(formula = BodyTemp ~ ., data = data)
Coefficients:
(Intercept) SwollenLymphNodesYes ChestCongestionYes
98.268642 -0.166066 0.102608
ChillsSweatsYes NasalCongestionYes SneezeYes
0.191826 -0.221568 -0.372709
FatigueYes SubjectiveFeverYes HeadacheYes
0.272509 0.436694 0.004977
Weakness.L Weakness.Q Weakness.C
0.254046 0.128981 0.023351
CoughIntensity.L CoughIntensity.Q CoughIntensity.C
0.170659 -0.166954 0.129975
Myalgia.L Myalgia.Q Myalgia.C
-0.132614 -0.131992 0.093532
RunnyNoseYes AbPainYes ChestPainYes
-0.064666 0.015375 0.100308
DiarrheaYes EyePnYes InsomniaYes
-0.152344 0.128786 -0.005824
ItchyEyeYes NauseaYes EarPnYes
-0.003889 -0.033246 0.111125
PharyngitisYes BreathlessYes ToothPnYes
0.315991 0.086379 -0.035497
VomitYes WheezeYes
0.160053 -0.034654
#take a look at the results of lm_fit2
glance (lm_fit2)
# A tibble: 1 × 12
r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 0.124 0.0853 1.14 3.19 2.47e-8 31 -1118. 2302. 2453. 914. 698
# … with 1 more variable: nobs <int>, and abbreviated variable names
# ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
Comparing models I was struggling to figure out how to compare these models with code. I did some searching and came across the package performance(). I’m going to use the compare_performance() function to do this.
compare_performance (lm_fit, lm_fit2)
# Comparison of Model Performance Indices
Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | R2 | R2 (adj.) | RMSE | Sigma
------------------------------------------------------------------------------------------------------
lm_fit | _lm | 2329.3 (<.001) | 2329.4 (<.001) | 2343.1 (>.999) | 0.012 | 0.011 | 1.188 | 1.190
lm_fit2 | _lm | 2301.6 (>.999) | 2304.8 (>.999) | 2453.1 (<.001) | 0.124 | 0.085 | 1.119 | 1.144
Logistic Modeling with Main predictor Here we will use tidymodels to fit a logistic model to Nausea, first with just the main predictor RunnyNose and later with all predictors.
#specify functional form of model
glm_mod <- logistic_reg () %>%
set_engine ("glm" )
#Fit a logistic model to the categorical outcome (Nausea) using only the main predictor of interest (RunnyNose)
glm_fit <- glm_mod %>%
fit (Nausea ~ RunnyNose, data = flu2)
glm_fit
parsnip model object
Call: stats::glm(formula = Nausea ~ RunnyNose, family = stats::binomial,
data = data)
Coefficients:
(Intercept) RunnyNoseYes
-0.65781 0.05018
Degrees of Freedom: 729 Total (i.e. Null); 728 Residual
Null Deviance: 944.7
Residual Deviance: 944.6 AIC: 948.6
#take a look athe the results of glm_fit
glance (glm_fit)
# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 945. 729 -472. 949. 958. 945. 728 730
Logistic modeling with all predictors
#Fit a logistic model to the categorical outcome (Nausea) using all predictors
glm_fit2 <- glm_mod %>%
fit (Nausea ~ ., data = flu2)
glm_fit2
parsnip model object
Call: stats::glm(formula = Nausea ~ ., family = stats::binomial, data = data)
Coefficients:
(Intercept) SwollenLymphNodesYes ChestCongestionYes
0.05966 -0.24685 0.26529
ChillsSweatsYes NasalCongestionYes SneezeYes
0.29057 0.44007 0.16502
FatigueYes SubjectiveFeverYes HeadacheYes
0.23096 0.25502 0.33749
Weakness.L Weakness.Q Weakness.C
0.65600 0.31281 -0.11280
CoughIntensity.L CoughIntensity.Q CoughIntensity.C
-0.77141 -0.11999 -0.13857
Myalgia.L Myalgia.Q Myalgia.C
0.13177 -0.03511 -0.10548
RunnyNoseYes AbPainYes ChestPainYes
0.05264 0.94153 0.07153
DiarrheaYes EyePnYes InsomniaYes
1.06479 -0.32772 0.08410
ItchyEyeYes EarPnYes PharyngitisYes
-0.05269 -0.16544 0.29154
BreathlessYes ToothPnYes VomitYes
0.53063 0.48015 2.44716
WheezeYes BodyTemp
-0.27581 -0.03136
Degrees of Freedom: 729 Total (i.e. Null); 698 Residual
Null Deviance: 944.7
Residual Deviance: 752.1 AIC: 816.1
# A tibble: 1 × 8
null.deviance df.null logLik AIC BIC deviance df.residual nobs
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 945. 729 -376. 816. 963. 752. 698 730
Comparing models
compare_performance (glm_fit, glm_fit2)
# Comparison of Model Performance Indices
Name | Model | AIC (weights) | AICc (weights) | BIC (weights) | Tjur's R2 | RMSE | Sigma | Log_loss | Score_log | Score_spherical | PCP
----------------------------------------------------------------------------------------------------------------------------------------------
glm_fit | _glm | 948.6 (<.001) | 948.6 (<.001) | 957.8 (0.936) | 1.169e-04 | 0.477 | 1.139 | 0.647 | -107.871 | 0.012 | 0.545
glm_fit2 | _glm | 816.1 (>.999) | 819.2 (>.999) | 963.1 (0.064) | 0.246 | 0.415 | 1.038 | 0.515 | -Inf | 0.002 | 0.657
The model with all predictors has a lower AIC, so it appears to be the better model.