Lecture 33
College of Idaho
CSCI 2025 - Winter 2026
tidyclustThe tidyclust package provides a tidymodels-like interface for clustering.
USArreststidyclustk_means() model specification.k=2 for this example)."stats".tidyclust~ . to specify that we want to use all variables.tidyclust cluster object
K-means clustering with 2 clusters of sizes 21, 29
Cluster means:
Murder Assault UrbanPop Rape
2 11.857143 255.0000 67.61905 28.11429
1 4.841379 109.7586 64.03448 16.24828
Clustering vector:
Alabama Alaska Arizona Arkansas California
1 1 1 1 1
Colorado Connecticut Delaware Florida Georgia
1 2 1 1 1
Hawaii Idaho Illinois Indiana Iowa
2 2 1 2 2
Kansas Kentucky Louisiana Maine Maryland
2 2 1 2 1
Massachusetts Michigan Minnesota Mississippi Missouri
2 1 2 1 2
Montana Nebraska Nevada New Hampshire New Jersey
2 2 1 2 2
New Mexico New York North Carolina North Dakota Ohio
1 1 1 2 2
Oklahoma Oregon Pennsylvania Rhode Island South Carolina
2 2 2 2 1
South Dakota Tennessee Texas Utah Vermont
2 1 1 2 2
Virginia Washington West Virginia Wisconsin Wyoming
2 2 2 2 2
Within cluster sum of squares by cluster:
[1] 41636.73 54762.30
(between_SS / total_SS = 72.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss"
[6] "betweenss" "size" "iter" "ifault"
tidy() gives information about each cluster:
augment() gives the cluster assignment for each observation.# A tibble: 2 × 7
Murder Assault UrbanPop Rape size withinss cluster
<dbl> <dbl> <dbl> <dbl> <int> <dbl> <fct>
1 11.9 255 67.6 28.1 21 41637. 1
2 4.84 110. 64.0 16.2 29 54762. 2
pca_rec <- recipe(~., data = USArrests) |>
step_normalize(all_numeric_predictors()) |>
step_pca(all_numeric_predictors(), num_comp = 2) |>
prep()
US_Arrests_PCA <- bake(pca_rec, new_data = USArrests)
kmeans_fit |>
augment(USArrests) |>
bind_cols(US_Arrests_PCA) |>
ggplot(aes(x = PC1, y = PC2, color = .pred_cluster)) +
geom_point(size = 3) +
labs(title = "K-Means Clustering (K=2)") +
theme_minimal()tune_cluster() to do this. Not this class.tidyclustLet’s do a hierarchical clustering with tidyclust.
factoextra package to do this.complete: The distance between two clusters is the maximum distance between any two points in the two clusters.average: The distance between two clusters is the average distance between all pairs of points in the two clusters.single: The distance between two clusters is the minimum distance between any two points in the two clusters.We can specify the linkage method in hier_clust().
tidyclust: A tidymodels-like interface for clustering.tune_cluster() can be used to tune the number of clusters.