Missing Values

Lecture 16

Dr. Eric Friedlander

College of Idaho
CSCI 2025 - Winter 2026

Introduction

Missing Values

  • Real-world data is often messy and contains missing values.
  • R represents missing values with NA.
  • There are explicit and implicit missing values.

Explicit Missing Values (NA)

is.na()

  • Most functions will return NA if any input is NA.
  • Use na.rm = TRUE to remove NAs before computation in functions like mean(), sum(), etc.

tidyr::fill()

  • Fills in missing values using the last known value (last observation carried forward).
df <- tibble(x = c(1, NA, NA, 2, NA), y = c("a", NA, "b", NA, NA))
df |> fill(x, y)
# A tibble: 5 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     1 a    
3     1 b    
4     2 b    
5     2 b    

coalesce() and na_if()

  • dplyr::coalesce() replaces NAs with values from another vector.
  • dplyr::na_if() replaces a specific value with NA.
x <- c(1, 2, NA, 4, -99)
coalesce(x, 0)
[1]   1   2   0   4 -99
na_if(x, -99)
[1]  1  2 NA  4 NA

Implicit Missing Values

Making Implicit NAs Explicit

  • Implicit missing values are values that are not present in the data.
  • For example, if a group has no observations, it won’t appear in summaries.

tidyr::complete()

  • complete() turns implicit missing values into explicit NAs by completing the full range of data.
df <- tibble(
  group = c("a", "a", "b"),
  year  = c(2020, 2022, 2021),
  value = 1:3
)
df |> complete(group, year)
# A tibble: 6 × 3
  group  year value
  <chr> <dbl> <int>
1 a      2020     1
2 a      2021    NA
3 a      2022     2
4 b      2020    NA
5 b      2021     3
6 b      2022    NA

NaN

Not a Number (NaN)

  • NaN (“Not a Number”) is a special numeric value that can arise from invalid mathematical operations, like 0/0.
  • is.nan() tests for NaN.
  • NaN is also considered NA, so is.na(NaN) is TRUE.

Thinking about missing values

Missingness in Data Analysis

  • Be careful about removing missing values without understanding why they are missing
  • Sometimes missing values aren’t missing at all
  • Sometimes non-missing values are actually missing
  • Sometimes there is a pattern to missing data (call this structured missingness)
  • Not undersanding missingness can lead to biased results!

Missing Data Mechanisms

  • Missing Completely at Random (MCAR): Missingness is unrelated to any data.
  • Missing at Random (MAR): Missingness is related to observed data.
  • Missing Not at Random (MNAR): Missingness is related to unobserved data.
  • Understanding the mechanism helps in choosing appropriate handling methods.

Practice!

Let’s practice!

Wrap Up

Do Next

  1. Read Chapter 18: Missing Values from r4ds.
  2. Open the Recitation Gem and say “Provide me practice problems for Chapter 18” or work through some of the exercises in the text.
  3. Move on to Lecture 17!