Feature Engineering with Recipes

Lecture 31

Dr. Eric Friedlander

College of Idaho
CSCI 2025 - Winter 2026

Introduction

Preprocessing Data

  • Feature Engineering: transforming variables to make them more suitable for a model (or visualization!).
  • Common tasks:
    • Converting categorical data to numeric (Dummy variables).
    • Centering and Scaling (Normalization).
    • Handling missing data.
    • Transforming skewed distributions (Log).
  • The recipes package: A tidy interface for data preprocessing.

The recipe Concept

  • A recipe is a blueprint for data processing.
  • It defines what you want to do, not when to do it.
  • Workflow:
    1. recipe(): Define the formula and data.
    2. step_*(): Add processing steps.
    3. prep(): Train the recipe (calculate means, SDs, levels, etc.).
    4. bake(): Apply the recipe to new data.

Load Packages

library(tidymodels)
library(palmerpenguins)
library(tidyverse)

Defining a Recipe

The Basics

  • Start with recipe(formula, data).
  • Role: outcome (LHS) vs predictor (RHS).
  • Data: Used only to check variable names and types (not for training yet).
simple_rec <- recipe(species ~ ., data = penguins)
simple_rec

Adding Steps

  • Steps are added using pipes (|>).
  • Examples (order matters!):
    • Impute missing data first.
    • Individual transformations (log).
    • Discretization / Dummy Variables.
    • Normalization (Center/Scale).
    • Multivariate transformations (PCA).

The Lifecycle: Prep & Bake

prep(): Estimating Parameters

  • prep() executes the recipe on the training data.
  • It calculates necessary statistics (means, SDs, factor levels).
# Define the full recipe
final_rec <- recipe(species ~ ., data = penguins) |>
  step_naomit(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

# Train it
prepped_rec <- prep(final_rec, training = penguins)
prepped_rec

bake(): Applying to Data

  • bake() applies the transformations to data.
  • Use new_data = NULL to get the processed data you used to prep the recipe.
# Process the data
processed_data <- bake(prepped_rec, new_data = NULL)

processed_data |> head()
# A tibble: 6 × 9
  bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year species
           <dbl>         <dbl>             <dbl>       <dbl> <dbl> <fct>  
1         -0.895         0.780            -1.42       -0.568 -1.28 Adelie 
2         -0.822         0.119            -1.07       -0.506 -1.28 Adelie 
3         -0.675         0.424            -0.426      -1.19  -1.28 Adelie 
4         -1.33          1.08             -0.568      -0.940 -1.28 Adelie 
5         -0.858         1.74             -0.782      -0.692 -1.28 Adelie 
6         -0.931         0.323            -1.42       -0.723 -1.28 Adelie 
# ℹ 3 more variables: island_Dream <dbl>, island_Torgersen <dbl>,
#   sex_male <dbl>

Why prep and bake?

  • Separation of concerns: You define the steps once, then apply them to:
    • Training data (data you use to create your model)
    • Testing data (data you use to evaluate your model)
    • New data in production
  • Ensures Data Leakage prevention:
    • Means/SDs are calculated on training data only.
    • Applied to test data.

Common Steps

Imputation & Transformations

  • step_naomit(): Remove rows with NA (simple).
  • step_impute_*(): Impute missing values (mean, median, knn).
  • step_log(): Log transform skewed variables.
  • step_mutate(): General mutations (similar to dplyr::mutate).
# Example: Log transform body mass
rec_log <- recipe(species ~ body_mass_g, data = penguins) |>
  step_log(body_mass_g)

Numeric Scaling

  • Centering: Subtract the mean (mean = 0).
  • Scaling: Divide by standard deviation (sd = 1).
  • Many important for algorithms sensitive to scale.
  • Use step_normalize() (does both center and scale).
rec_norm <- recipe(species ~ bill_length_mm, data = penguins) |>
  step_normalize(bill_length_mm)

Categorical Data (Dummy Variables)

  • Many algorithms requires numeric input (PCA, K-Means, Linear Regression).
  • Dummy Variables: Convert categories into binary 0/1 columns.
  • step_dummy(all_nominal_predictors()).
rec_dummy <- recipe(body_mass_g ~ species + island, data = penguins) |>
  step_dummy(all_nominal_predictors())
  • Watch out for the “dummy trap” (multicollinearity), usually handled by default (one level is reference).

Getting ready for PCA and Clustering

  • For PCA and Clustering:
    • Use step_normalize() (Variable scale influences distance).
    • Use step_dummy() (Must be numeric).
    • Often drop unwanted variables.
clustering_rec <- recipe(~ ., data = penguins) |>
  step_rm(species, year, sex) |> # Remove non-predictive/outcome vars
  step_naomit(all_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

clustering_rec

Wrap-Up

Recap

  • Recipe: A plan for data processing.
  • Steps:
    • step_dummy(): Categorical -> Numeric.
    • step_normalize(): Center and Scale.
  • Process:
    1. Define recipe().
    2. Add step_*().
    3. prep() on training data.
    4. bake() to get the result.
  • Next Time: Using this for Dimensionality Reduction (PCA)!

Do Next

  1. Read Chapter 8: Feature engineering with recipes from Tidy Modeling with R.
  2. There’s NO recitation Gem for this textbook but I recommend creating your own and adding the textbook chapter and these slides.
  3. Move on to Lecture 32!