Data Tidying

Lecture 5

Dr. Eric Friedlander

College of Idaho
CSCI 2025 - Winter 2026

This lesson

  • Define “tidy data”
  • Learn how to use tidyr to make data tidy
    • pivot_longer() for lengthening data
    • pivot_wider() for widening data

What is tidy data?

Tidy data is a consistent way of structuring datasets that makes them easier to work with. A dataset is tidy if it follows three rules:

  1. Each variable has its own column.
  2. Each observation has its own row.
  3. Each value has its own cell.

This structure is a standard in the tidyverse.

Setup

We will be using functions from the tidyr package, which is part of the tidyverse.

# load packages
library(tidyverse)

Lengthening data with pivot_longer()

  • Wide Format: column names are actually values of a variable:
relig_income |> head()
# A tibble: 6 × 11
  religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        122
2 Atheist        12        27        37        52        35        70         73
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Don’t kn…      15        14        15        11        10        35         21
6 Evangeli…     575       869      1064       982       881      1486        949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

This is not tidy. The income brackets are variable names, not a variable.

Using pivot_longer()

  • Goal: transform this so we have three columns: religion, income, and count.
  • Use pivot_longer() to do this.
tidy_relig_income <- relig_income |> 
  pivot_longer(
    cols = !religion, 
    names_to = "income", 
    values_to = "count"
  )

tidy_relig_income |> head()
# A tibble: 6 × 3
  religion income  count
  <chr>    <chr>   <dbl>
1 Agnostic <$10k      27
2 Agnostic $10-20k    34
3 Agnostic $20-30k    60
4 Agnostic $30-40k    81
5 Agnostic $40-50k    76
6 Agnostic $50-75k   137
  • cols: The columns to pivot into longer format. !religion means all columns except religion.
  • names_to: The name of the new column that will contain the names of the original columns.
  • values_to: The name of the new column that will contain the values from the original columns.

Practice

Let’s practice with another dataset: billboard.

Widening data with pivot_wider()

pivot_wider() is the opposite of pivot_longer(). It’s used when an observation is scattered across multiple rows.

fish_encounters |> head()
# A tibble: 6 × 3
  fish  station  seen
  <fct> <fct>   <int>
1 4842  Release     1
2 4842  I80_1       1
3 4842  Lisbon      1
4 4842  Rstr        1
5 4842  Base_TD     1
6 4842  BCE         1
  • One observation = one fishing station
  • fish_encounters has two rows for each station: one for when a fish was seen and one for when it wasn’t
  • station column contains variable names
  • seen column contains(0/1) values

Using pivot_wider()

Goal: one row per station, with columns indicating whether fish was seen or not.

fish_encounters |>
  pivot_wider(
    names_from = station,
    values_from = seen
  ) |> 
    head()
# A tibble: 6 × 12
  fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
  <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
1 4842        1     1      1     1       1     1     1     1     1     1     1
2 4843        1     1      1     1       1     1     1     1     1     1     1
3 4844        1     1      1     1       1     1     1     1     1     1     1
4 4845        1     1      1     1       1    NA    NA    NA    NA    NA    NA
5 4847        1     1      1    NA      NA    NA    NA    NA    NA    NA    NA
6 4848        1     1      1     1      NA    NA    NA    NA    NA    NA    NA

Problem: NA values where there should be zeros, can fix this with values_fill.

Using pivot_wider() with values_fill

fish_encounters |>
  pivot_wider(
    names_from = station,
    values_from = seen,
    values_fill = 0
  ) |> 
    head()
# A tibble: 6 × 12
  fish  Release I80_1 Lisbon  Rstr Base_TD   BCE   BCW  BCE2  BCW2   MAE   MAW
  <fct>   <int> <int>  <int> <int>   <int> <int> <int> <int> <int> <int> <int>
1 4842        1     1      1     1       1     1     1     1     1     1     1
2 4843        1     1      1     1       1     1     1     1     1     1     1
3 4844        1     1      1     1       1     1     1     1     1     1     1
4 4845        1     1      1     1       1     0     0     0     0     0     0
5 4847        1     1      1     0       0     0     0     0     0     0     0
6 4848        1     1      1     1       0     0     0     0     0     0     0
  • names_from: The column to get the new column names from.
  • values_from: The column to get the cell values from.
  • values_fill: A value to replace NAs with.

Practice

Let’s practice with cms_patient_care

Summary

  • Tidy data is a standard format that makes data analysis easier.
  • Use pivot_longer() when your column names are actually values of a variable (to make data longer and narrower).
  • Use pivot_wider() when an observation is scattered across multiple rows (to make data wider and shorter).
  • You will often need to use pivot_longer() and pivot_wider() when preparing data for plotting

These two functions are the foundation of data tidying in R.

Wrap Up

Do Next

  1. Read Chapter 5: Data Tidying from r4ds.
  2. Open the Recitation Gem and say “Provide me practice problems for Chapter 5” or work through some of the exercises in the text.
  3. Complete Lecture 6.