Exploratory Data Analysis

Lecture 8

Dr. Eric Friedlander

College of Idaho
CSCI 2025 - Winter 2026

What is EDA?

Exploratory Data Analysis

EDA is an iterative cycle. You:

Generate questions about your data.
Search for answers by visualizing, transforming, and modelling your data.
Use what you learn to refine your questions and/or generate new questions.

Fun Quotes

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

Guiding Questions

Two main types of questions to ask:

What type of variation occurs within my variables?
What type of covariation occurs between my variables?

Data: `tips`

We will use the tips dataset from the reshape2 package (you will likely have to install it).

Variation

Visualizing Distributions

Variation is the tendency of the values of a variable to change from measurement to measurement.

Every variable has its own pattern of variation, which can reveal interesting information.
The best way to understand that pattern is to visualize the variable’s distribution.

Questions

Which values are the most common? Why?
Which values are rare? Why? Does that match your expectations?
Can you see any unusual patterns? What might explain them?

Recall: Univarite Plots

Categorical variables: bar charts
Numerical variables: histograms, boxplots

Practice

Let’s practice exploring tip sizes.

Unusual Values

Outliers

Outliers are observations that are unusual; points that don’t seem to fit the pattern.
Sometimes they are data entry errors; other times they are genuinely important.

Handling Outliers

If they are data entry errors, try to fix them or remove them.
If they are genuine, their presence is important information.
Neither of these conclusions can be made without understanding the context of the data and looking deeply at the observations

More Practice

Let’s complete exercise 5
Let’s explore some outliers in the tips dataset

Covariation

Data: `mpg`

We will also use the mpg dataset from the ggplot2 package.
It contains information about the fuel economy of different car models.

glimpse(mpg)

Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Visualizing Relationships

Covariation: how do two variables change together? What is their relationship?
Visualizing covariation helps you understand relationships between variables
You need to think and plan carefully about how to display multiple distributions in a single plot

Recall: Three general cases

Categorical + Categorical
- Boxplots
Categorical + Numerical
- Boxplots, jitter plots, faceted histograms
Numerical + Numerical
- Scatterplots, jitter plots

Reordering Categories

You can reorder the categorical variable to make the plot easier to read.
What kind of car get’s the best gas mileage?

Some new plots

Categorical + Categorical
- Mosaic plots, count plots, tile plots
Categorical + Numerical
- Violin plots, sina plots, ridgeline plots, frequency polygons
Numerical + Numerical
- 2D-Histogram/heatmap of counts, hexbin plots

Practice

Let’s practice exploring covariation in the our data sets

Patterns

Patterns in your data hint at relationships. If you can spot a pattern, ask yourself:
- Could this pattern be due to coincidence?
- How can you describe the relationship implied by the pattern?
- How strong is the relationship implied by the pattern?
- What other variables might affect the relationship?
- Does the relationship change if you look at individual subgroups of the data?
Models can help you extract the strong patterns and leave the weaker ones.

Transformations

Transformations can help you see patterns more clearly.
Common transformations include:
- Logarithms
- Logarithms
- Logarithms
- Square roots
- Logarithms
- Standardization (z-scores)
- Logarithms

The power of logs

Power law: when \(y = ax^k\)
- Taking the log of both sides gives: \(\log(y) = \log(a) + k\log(x)\)
- Intuitively: multiplying x by a constant multiplies y by a constant.
Power law’s are extremely common in nature:
- Paretto principle (80/20 rule)
- City sizes
- Earthquake magnitudes
- Wealth distributions
- Bacteria growth

Practice

Let’s practice looking for patterns in out data sets.

Wrap-Up

EDA in a nutshell

EDA is about asking questions and using visualizations to find answers.
Start by looking at the variation of each variable.
Then, explore the covariation between variables.
Use ggplot2 to create a wide range of plots to help you understand your data.
DON’T FORGET TO THINK!

Do Next

Read Chapter 10: Exploratory data analysis from r4ds.
Open the Recitation Gem and say “Provide me practice problems for Chapter 10” or work through some of the exercises in the text.
That’s it for today. See you tomorrow!

Exploratory Data Analysis

What is EDA?

Exploratory Data Analysis

Fun Quotes

Guiding Questions

Data: tips

Variation

Visualizing Distributions

Questions

Recall: Univarite Plots

Practice

Unusual Values

Outliers

Handling Outliers

More Practice

Covariation

Data: mpg

Visualizing Relationships

Recall: Three general cases

Reordering Categories

Some new plots

Practice

Patterns

Patterns

Transformations

The power of logs

Practice

Wrap-Up

EDA in a nutshell

Do Next

Data: `tips`

Data: `mpg`