SciPy US 2021

5 minute read

Here is the recording of the workshop.

Here is the link to the repo with the set up instructions for the workshop.

Here is the proposal I submitted for this workshop.

Title

“(D)Ask me Anything About Data Analytics at Scale”

Description

Exploratory data analysis and statistical modeling are two crucial components of the data analytics cycle and while these might be “more easily” done with small(ish) datasets, the story changes as the amount of data we want to analyze increases. With this in mind, the goal of this tutorial is to get participants comfortable with exploring, analyzing, modeling, and visualizing large datasets that don’t fit in memory.

Throughout the tutorial, participants will learn how to analyze large datasets using dask, a Python package built for data analytics at scale, and how to create data visualizations and a dashboard using datashader, panel, and holoviews. This tutorial will use a top-down approach, hence, each section starts with the punchline (i.e. the results) and we then work our way backward with clear explanations on how and why each question was asked, each function was created, and each choice was made. One of the benefits of the top-down approach of this tutorial is that participants will be able to take what they learn throughout the session and use it to analyze or reverse engineer the many large datasets and results, respectively, that they might encounter in the wild.

This tutorial follows the session called, (D)Ask me Anything about Munging, Preprocessing, and Wrangling Data at Scale from SciPy Japan 2020.

The size of datasets in diverse industries keeps increasing by the minute and, luckily, for data professionals using Python as their day-to-day tool there is Dask, a library for large scale analytics and data preprocessing. This tutorial takes advantage of dask and other libraries to teach you how to tackle data analysis at scale while covering techniques for data exploration, hypothesis testing, and dashboard creating using large datasets. In addition, this tutorial uses a top-down approach, meaning, in each of the major sections you will start with the results and work your way backwards through the large scale data analytics process.

Being able to analyze large quantities of data is useful for data professionals from all levels and within different industries, for example, business analysts, developers and data scientists alike who want to showcase some of the insights they have uncovered, to developers wanting to explain the most important metrics stakeholders should pay attention to in their big data applications, to data scientists wanting to explain the results of their experiments, models and the like. With this in mind, the goal of this tutorial is to teach data professionals from diverse fields and at diverse levels, how to analyse large datasets —data that does not fit in memory — using their own machines.

This tutorial follows a top-down approach in all three of its part. This means that each time-block starts with the punchline (i.e. the results) and we will then work our way backwards with clear explanations on how and why each question was asked, each function was created, and each choice was made. One of the added benefits of this approach is that participants will be able to take what they learn throughout the session and use it to analyze and reverse engineer, dataset and results, respectively, that they might encounter in the wild.

Audience

The target audience for this session are analysts of all level, developers, data scientists and engineers wanting to learn how to analyse large amounts of data that don’t fit into the memory of their machines.

Prerequisites (P) and Good To Have’s (GTH)

  • The target audience for this session includes analysts of all levels, developers, data scientists, and engineers wanting to learn how to analyze large amounts of data that don’t fit into the memory of their machines.

    This tutorial is at the intermediate level and requires that participants have at least 1 year of experience coding in Python. The following are some of the Prerequisites (P) and Good To Have’s (GTH)

    • (P) Attendees for this tutorial are expected to be familiar with Python (1 year of coding).
    • (P) Participants should be comfortable with loops, functions, lists comprehensions, and if-else statements.
    • (GTH) While it is not necessary to have knowledge of dask, pandas, NumPy, datashader, and Holoviews, a bit of experience with these libraries would be very beneficial throughout this tutorial.
    • (P) Participants should have at least 6 GB of free memory in their computers.
    • (GTH) While it is not required to have experience with an integrated development environment like Jupyter Lab, this would be very beneficial for the session as it is the tool we will be using all throughout

Outline

    • The total time budgeted, including breaks, is ~4 hours

      1. Introduction and Setup (~35 minutes)
        • Environment set up. We will be using Jupyter Lab all throughout the session but participants experiencing difficulties will also have the option to walk through the tutorial with a smaller dataset using Binder or Google Colab
        • Quick breakdown of the entire session
        • Instructor introduction
        • Intro to the dataset
        • Main Questions/Hypotheses to be explored during the Session
        • How to approach large datasets for statistical analysis?
      2. 5-minute break
      3. Exploratory Data Analysis at Scale (~50 minutes)
        • Intro to the dataset
        • Fact-checking - Questions to explore as well as their answers
        • Breakdown of the functions used to get the results
        • Breakdown of how Dask dataframes and arrays helped us answered the exploratory questions posed
        • Exercises (10-min). For this exercises, participants will be given three exploratory data analysis-related questions to tackle with custom functions and dask dataframes and/or arrays
      4. 10-minute break
      5. Hypothesis Testing (~50 minutes)
        • Hypotheses testing results
        • Walk-through of statistical models chosen
        • Walk-through of Hypothesis Testing and the functions used in Dask
        • Exercises (10-min). For these exercises, participants will be given 3 approaches to choose from, and use for, validating the very last hypotheses posed at the beginning of the session
      6. 10-minute break
      7. Interactive Dashboard Creation (~50 minutes)
        • Interactive dashboard creation and customization
        • Datashader and Holoviews overview
        • Dashboard and main functions breakdown
        • Exercises (10-min). For these exercises, participants will be given a dashboard with 3-5 visualizations as well as the dataset, and their task is to reverse engineer any of the visualizations in the dashboard using Dask, datashader, and holoviews

Additional Notes

I work as an educator, researcher, and data scientist, and have taught Python to hundreds of students with backgrounds ranging from complete beginner to advanced user. My lessons are full of metaphors, quotes, funny pictures, and exercises as I love to make sure students leave my sessions with at least a laugh, a new concept learned, a new Python trick, or all of the above.

I have done several short tutorials at meetups and other venues on a variety of topics within the space of data analytics and Python programming. Most recently, I taught one of the 3.5-hour tutorials at SciPy Japan 2020 using a bottom-up approach, and now I am excited to do another session using the reverse approach.

Updated: