Exploratory Data Analysis

Lecture 8

Dr. Eric Friedlander

College of Idaho
CSCI 2025 - Winter 2026

What is EDA?

Exploratory Data Analysis

EDA is an iterative cycle. You:

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modelling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

Fun Quotes

“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox

“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey

Guiding Questions

Two main types of questions to ask:

  1. What type of variation occurs within my variables?

  2. What type of covariation occurs between my variables?

Data: tips

  • We will use the tips dataset from the reshape2 package (you will likely have to install it).

Variation

Visualizing Distributions

Variation is the tendency of the values of a variable to change from measurement to measurement.

  • Every variable has its own pattern of variation, which can reveal interesting information.
  • The best way to understand that pattern is to visualize the variable’s distribution.

Questions

  • Which values are the most common? Why?

  • Which values are rare? Why? Does that match your expectations?

  • Can you see any unusual patterns? What might explain them?

Recall: Univarite Plots

  • Categorical variables: bar charts
  • Numerical variables: histograms, boxplots

Practice

  • Let’s practice exploring tip sizes.

Unusual Values

Outliers

  • Outliers are observations that are unusual; points that don’t seem to fit the pattern.
  • Sometimes they are data entry errors; other times they are genuinely important.

Handling Outliers

  • If they are data entry errors, try to fix them or remove them.
  • If they are genuine, their presence is important information.
  • Neither of these conclusions can be made without understanding the context of the data and looking deeply at the observations

More Practice

  • Let’s complete exercise 5
  • Let’s explore some outliers in the tips dataset

Covariation

Data: mpg

  • We will also use the mpg dataset from the ggplot2 package.
  • It contains information about the fuel economy of different car models.
glimpse(mpg)
Rows: 234
Columns: 11
$ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "…
$ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "…
$ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.…
$ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200…
$ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, …
$ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto…
$ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4…
$ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1…
$ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2…
$ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p…
$ class        <chr> "compact", "compact", "compact", "compact", "compact", "c…

Visualizing Relationships

  • Covariation: how do two variables change together? What is their relationship?
  • Visualizing covariation helps you understand relationships between variables
  • You need to think and plan carefully about how to display multiple distributions in a single plot

Recall: Three general cases

  • Categorical + Categorical
    • Boxplots
  • Categorical + Numerical
    • Boxplots, jitter plots, faceted histograms
  • Numerical + Numerical
    • Scatterplots, jitter plots

Reordering Categories

  • You can reorder the categorical variable to make the plot easier to read.
  • What kind of car get’s the best gas mileage?

Some new plots

  • Categorical + Categorical
    • Mosaic plots, count plots, tile plots
  • Categorical + Numerical
    • Violin plots, sina plots, ridgeline plots, frequency polygons
  • Numerical + Numerical
    • 2D-Histogram/heatmap of counts, hexbin plots

Practice

  • Let’s practice exploring covariation in the our data sets

Patterns

Patterns

  • Patterns in your data hint at relationships. If you can spot a pattern, ask yourself:
    • Could this pattern be due to coincidence?
    • How can you describe the relationship implied by the pattern?
    • How strong is the relationship implied by the pattern?
    • What other variables might affect the relationship?
    • Does the relationship change if you look at individual subgroups of the data?
  • Models can help you extract the strong patterns and leave the weaker ones.

Transformations

  • Transformations can help you see patterns more clearly.
  • Common transformations include:
    • Logarithms
    • Logarithms
    • Logarithms
    • Square roots
    • Logarithms
    • Standardization (z-scores)
    • Logarithms

The power of logs

  • Power law: when \(y = ax^k\)
    • Taking the log of both sides gives: \(\log(y) = \log(a) + k\log(x)\)
    • Intuitively: multiplying x by a constant multiplies y by a constant.
  • Power law’s are extremely common in nature:
    • Paretto principle (80/20 rule)
    • City sizes
    • Earthquake magnitudes
    • Wealth distributions
    • Bacteria growth

Practice

  • Let’s practice looking for patterns in out data sets.

Wrap-Up

EDA in a nutshell

  • EDA is about asking questions and using visualizations to find answers.
  • Start by looking at the variation of each variable.
  • Then, explore the covariation between variables.
  • Use ggplot2 to create a wide range of plots to help you understand your data.
  • DON’T FORGET TO THINK!

Do Next

  1. Read Chapter 10: Exploratory data analysis from r4ds.
  2. Open the Recitation Gem and say “Provide me practice problems for Chapter 10” or work through some of the exercises in the text.
  3. That’s it for today. See you tomorrow!