Data science lifecycle & Exploratory data analysis using visualization

Research Beyond the Lab: Open Science and Research Methods for a Global Engineer

Lars Schöbitz

Feb 29, 2024

Email from GitHub?

While we are getting ready, please check for this email from GitHub and accept the invitation to join the GitHub organisation for the course. Used Gmail to sign up? Check the folders that aren’t your primary inbox (e.g Updates).

Solving coding problems

Tipps for search engines

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query
  • Scroll through the top 5 results (don’t just pick the first)

Example: “How to remove a legend from a plot in R ggplot2”

Stack Overflow

What is it?

  • The biggest support network for (coding) problems
  • Can be intimidating at first
  • Up-vote system

Workflow

  • First, briefly read the question that was posted
  • Then, read the answer marked as “correct”
  • Then, read one or two more answers with high votes
  • Then, check out the “Linked” posts
  • Always give credit for the solution

Tipps for AI tools

  • Use actionable verbs that describe what you want to do
  • Be specific
  • Add R to the search query
  • Add the name of the R package name to the search query

Example: “How to remove a legend from a plot in R ggplot2”

Other sources for help

Interaction with GitHub

open GitHub organisation

Bookmark this link in your browser!

github.com/rbtl-fs24

## on GitHub Organisation

  • Search for your username in search bar under Repositories

on your repository

on Posit Cloud

Bookmark this link in your browser!

on Posit Cloud

Your turn: Introduce yourself to git

  1. Open a web browser on your laptop.
  2. Navigate to the course website: rbtl-fs24.github.io/website/
  3. If you haven’t yet, bookmark the course website
  4. In the left-hand menu, click on Module 2, then select am-01: Git configuration
  5. Follow the instructions
  6. Place a yellow sticky note on your laptop when you have completed the assignment
10:00

Version Control - Terminology

-

-

-

-

-

-

-

-

-

-

-

-

remember: git commit

remember: git push

remember: git push

collaborate: git clone

track work: git commit

update: git ???

update: git push

git ???

get updates: git pull

Learning Objectives (for this week)

  1. Learners can identify four components of a Quarto file (YAML, code chunk, R code, markdown).
  2. Learners can list the six elements of the data science lifecycle.
  3. Learners can describe the four main aesthetic mappings that can be used to visualise data using the ggplot2 R Package.
  4. Learners can control the colour scaling applied to a plot using colour as an aesthetic mapping.
  5. Learners can compare three different geoms (bar/col, histogram, point) and their use case.

Data Science Lifecycle

Deep End

via GIPHY

-

-

-

-

-

-

-

Exploratory Data Analysis with ggplot2

R Package ggplot2

  • ggplot2 is tidyverse’s data visualization package
  • gg in ggplot2 stands for Grammar of Graphics
  • Inspired by the book Grammar of Graphics by Leland Wilkinson
  • Documentation: https://ggplot2.tidyverse.org/
  • Book: https://ggplot2-book.org

My turn: Working with R



Sit back and enjoy!

Take a break

Please get up and move! Let your emails rest in peace.

10:00

Code structure

  • ggplot() is the main function in ggplot2
  • Plots are constructed in layers
  • Structure of the code for plots can be summarized as
ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], 
                     y = [y-variable])) +
  geom_xxx() +
  other options

Code structure

ggplot()

Code structure

ggplot(data = gapminder)

Code structure

ggplot(data = gapminder,
       mapping = aes()) 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp))  

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() 

Code structure

ggplot(data = gapminder,
       mapping = aes(x = continent,
                     y = lifeExp)) +
  geom_boxplot() +
  theme_minimal()

Polls

Poll 1: What does the thick line inside the box of a boxplot represent?

  1. I don’t know
  2. the mean of the observations
  3. the middle of the box
  4. the median of the observations

Poll 2: What percentage of observations are contained inside the box of a boxplot (interquartile range)?

  1. I don’t know
  2. 25%
  3. depends on the median
  4. 50%

Poll 3: What is the median of a set of observations?

  1. I don’t know
  2. The median is the most frequently occurring value in a dataset.
  3. The median is the sum of all values in a dataset divided by the number of observations.
  4. The median is the point above and below which half (50%) of the observations falls.

Poll 4: If you have the values: 1, 2, 3, and 10: which statistical measure best represents the “true” value?

  1. I don’t know
  2. The maximum
  3. The standard deviation
  4. The median

Boxplot, explained

A diagram depicting how a boxplot is created following the steps outlined above.

Figure 1: Diagram depicting how a boxplot is created.

Our turn: md-02-exercises

  1. Open posit.cloud in your browser (use your bookmark).
  2. Open the rbtl-fs24 workspace for the course.
  3. Click Start next to md-02-exercises.
  4. In the File Manager in the bottom right window, locate the md-02b-data-visualization.qmd file and click on it to open it in the top left window.
20:00

Take a break

Please get up and move! Let your emails rest in peace.

10:00

Visualizing data

Types of variables

numerical

discrete variables

  • non-negative
  • whole numbers
  • e.g. number of students, roll of a dice

continuous variables

  • infinite number of values
  • also dates and times
  • e.g. length, weight, size

non-numerical

categorical variables

  • finite number of values
  • distinct groups (e.g. EU countries, continents)
  • ordinal if levels have natural ordering (e.g. week days, school grades)

Histogram

  • for visualizing distribution of continuous (numerical) variables
ggplot(data = penguins,
       mapping = aes(x = body_mass_g)) +
  geom_histogram()

Barplot

  • for visualizing distribution of categorical (non-numerical) variables
ggplot(data = penguins,
       mapping = aes(x = species)) +
  geom_bar()

Scatterplot

  • for visualizing relationships between two continuous (numerical) variables
ggplot(data = gapminder_2007,
       mapping = aes(x = gdpPercap,
                     y = lifeExp,
                     size = pop,
                     color = continent)) +
  geom_point() +
  scale_color_colorblind() +
  theme_minimal()

Your turn: md-02-exercises

  1. Open posit.cloud in your browser (use your bookmark).
  2. Open the rbtl-fs24 workspace for the course.
  3. In the File Manager in the bottom right window, locate the md-02c-make-a-plot.qmd file and click on it to open it in the top left window.
  4. Follow instructions in the file
15:00

Homework assignments module 2

Module 2 documentation

Homework due date

  • Homework assignment due: Wednesday, March 6th
  • Correction & feedback phase up to: Tuesday, March 12th

Wrap-up

Thanks! 🌻

Slides created via revealjs and Quarto: https://quarto.org/docs/presentations/revealjs/ Access slides as PDF on GitHub

All material is licensed under Creative Commons Attribution Share Alike 4.0 International.