Regular Expressions

Lecture 13

Dr. Eric Friedlander

College of Idaho
CSCI 2025 - Winter 2026

Introduction

Regular Expressions (Regex)

  • A powerful tool for describing patterns in strings.
  • Used for finding, extracting, and replacing text.
  • stringr provides a consistent interface for working with regex.

Key Functions

str_detect()

  • str_detect() returns TRUE if a pattern is found in a string, FALSE otherwise.
x <- c("apple", "banana", "pear", "pineapple", "naartjie")
str_detect(x, "an")
[1] FALSE  TRUE FALSE FALSE FALSE

str_count()

  • str_count() counts the number of matches in a string.
str_count("banana", "a")
[1] 3

str_extract() and str_extract_all()

  • str_extract() extracts the first match.
  • str_extract_all() extracts all matches.
str_extract("banana", "an")
[1] "an"
str_extract_all("banana", "an")
[[1]]
[1] "an" "an"

str_replace() and str_replace_all()

  • Replaces the first or all matches with a new string.
x <- c("Dr_Eric_Friedlander", "Dr_Brandy_Wiegers", "Anthony_Campitelli")
str_replace(x, "_", " ")
[1] "Dr Eric_Friedlander" "Dr Brandy_Wiegers"   "Anthony Campitelli" 
str_replace_all(x, " ", " ")
[1] "Dr_Eric_Friedlander" "Dr_Brandy_Wiegers"   "Anthony_Campitelli" 

Regex

Instead of inputting literal strings, you can use regex patterns to describe more complex matches.

x <- c("Dr_Eric_Friedlander", "Dr_Brandy_Wiegers", "Anthony_Campitelli")
# remove all vowels
str_replace_all(x, "[aeiouAEIOU]", "")
[1] "Dr_rc_Frdlndr" "Dr_Brndy_Wgrs" "nthny_Cmptll" 

Pattern Components

Anchors

  • ^ matches the start of the string.
  • $ matches the end of the string.
x <- c("apple", "banana", "pear")
str_detect(x, "^a")
[1]  TRUE FALSE FALSE
str_detect(x, "a$")
[1] FALSE  TRUE FALSE

Character Classes

  • . matches any character except a newline.
  • \d matches any digit.
  • \s matches any whitespace.
  • [abc] matches a, b, or c.
  • [^abc] matches anything except a, b, or c.
  • | matches either the expression before or after the |.

Repetition

  • ?: 0 or 1 time.
  • +: 1 or more times.
  • *: 0 or more times.
  • {n}: exactly n times.
  • {n,}: n or more times.
  • {n,m}: between n and m times.

Practice!

What do each of these do?

  • "^b.*a$"
  • "^.{5}$"
  • "[aeiou]"
  • Create a regular expression that will match telephone numbers as commonly written in your country.

Grouping and Back References

() for Grouping

  • Parentheses create a “capturing group” to extract parts of a match.
str_match("apple, banana, pear", "([a-z]+), ([a-z]+)")
     [,1]            [,2]    [,3]    
[1,] "apple, banana" "apple" "banana"

Back References

  • \1, \2, etc. refer to previously captured groups.
str_replace("abab", "(a)(b)", "\\2\\1")
[1] "baab"

Practice!

Exercise 6 from 15.4.7.

Other Tools

tidyr::separate_wider_regex()

  • Separates a column into multiple columns using regex with capture groups.
df <- tibble(x = "123-abc")
df |> separate_wider_regex(x, c(num = "\\d+", "-", chr = "[a-z]+"))
# A tibble: 1 × 2
  num   chr  
  <chr> <chr>
1 123   abc  

fixed()

  • Use fixed() to match a literal string without interpreting it as a regex.
str_detect("a.b", fixed("."))
[1] TRUE

Practice!

Let’s extract the AR codes and the county names from the Naturalization data!

Wrap-Up

Do Next

  1. Read Chapter 15: Regular Expression from r4ds.
  2. Open the Recitation Gem and say “Provide me practice problems for Chapter 15” or work through some of the exercises in the text.
  3. That it for tonight! See you tomorrow.