Data science lifecycle & Exploratory data analysis using visualization

Research Beyond the Lab: Open Science and Research Methods for a Global Engineer

Lars Schöbitz

Feb 29, 2024

Solving coding problems

Tipps for search engines

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query
  • Scroll through the top 5 results (don’t just pick the first)

Example: “How to remove a legend from a plot in R ggplot2”

Stack Overflow

What is it?

  • The biggest support network for (coding) problems
  • Can be intimidating at first
  • Up-vote system


  • First, briefly read the question that was posted
  • Then, read the answer marked as “correct”
  • Then, read one or two more answers with high votes
  • Then, check out the “Linked” posts
  • Always give credit for the solution

Tipps for AI tools

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query

Example: “How to remove a legend from a plot in R ggplot2”

Other sources for help

Interaction with GitHub

## on GitHub Organisation

  • Search for your username in search bar under Repositories

on your repository

on Posit Cloud

on Posit Cloud

Your turn: Introduce yourself to git

Version Control - Terminology













remember: git commit

remember: git push

remember: git push

collaborate: git clone

track work: git commit

update: git ???

update: git push

git ???

get updates: git pull

Learning Objectives (for this week)

  1. Learners can identify four components of a Quarto file (YAML, code chunk, R code, markdown).
  2. Learners can list the six elements of the data science lifecycle.
  3. Learners can describe the four main aesthetic mappings that can be used to visualise data using the ggplot2 R Package.
  4. Learners can control the colour scaling applied to a plot using colour as an aesthetic mapping.
  5. Learners can compare three different geoms (bar/col, histogram, point) and their use case.

Data Science Lifecycle

Deep End









Exploratory Data Analysis with ggplot2

R Package ggplot2

  • ggplot2 is tidyverse’s data visualization package
  • gg in ggplot2 stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Leland Wilkinson
  • Documentation:
  • Book:

My turn: Working with R

Code structure

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], 
                     y = [y-variable])) +
  geom_xxx() +
  other options

Code structure


Code structure

ggplot(data = gapminder)

Code structure

ggplot(data = gapminder,
       mapping = aes()) 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp))  

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() +


Poll 1: What does the thick line inside the box of a boxplot represent?

  1. I don’t know
  2. the mean of the observations
  3. the middle of the box
  4. the median of the observations

Poll 2: What percentage of observations are contained inside the box of a boxplot (interquartile range)?

  1. I don’t know
  2. 25%
  3. depends on the median
  4. 50%

Poll 3: What is the median of a set of observations?

  1. I don’t know
  2. The median is the most frequently occurring value in a dataset.
  3. The median is the sum of all values in a dataset divided by the number of observations.
  4. The median is the point above and below which half (50%) of the observations falls.

Poll 4: If you have the values: 1, 2, 3, and 10: which statistical measure best represents the “true” value?

  1. I don’t know
  2. The maximum
  3. The standard deviation
  4. The median

Boxplot, explained

A diagram depicting how a boxplot is created following the steps outlined above.

Figure 1: Diagram depicting how a boxplot is created.

Our turn: md-02-exercises

Visualizing data

Types of variables


discrete variables

  • non-negative
  • whole numbers
  • e.g. number of students, roll of a dice

continuous variables

  • infinite number of values
  • also dates and times
  • e.g. length, weight, size


categorical variables

  • finite number of values
  • distinct groups (e.g. EU countries, continents)
  • ordinal if levels have natural ordering (e.g. week days, school grades)


  • for visualizing distribution of continuous (numerical) variables
ggplot(data = penguins,
       mapping = aes(x = body_mass_g)) +


  • for visualizing distribution of categorical (non-numerical) variables
ggplot(data = penguins,
       mapping = aes(x = species)) +


  • for visualizing relationships between two continuous (numerical) variables
ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     size = pop,
                     color = continent)) +
  geom_point() +
  scale_color_colorblind() +

Thanks! 🌻

