PyCon APAC 2021
Data Pipelines 4 All
Here is the link to the repo with the set up instructions for the workshop.
Description
Pipelines are useful tools for data professionals at all levels and within different industries. From analysts who want to build processes to automate their analyses, to data engineers building extract, transform, and load pipelines, or even data scientists building models that require that a series of steps occur on the data needed before making a prediction (e.g. tokenization, scaling, or one of the many feature engineering techniques available). With this in mind, the goal of this tutorial is to help data professionals from diverse fields and at diverse levels build pipelines that can move and transform data as well as make useful predictions given different sets of inputs.
The tutorial will emphasize both methodology and frameworks through a top-down approach. Several of the open source libraries included are Prefect, MLFlow, Scikit-Learn, XGBoost, FastAPI, pandas, and the HoloViz suite of libraries. In addition, the tutorial covers important concepts regarding data engineering, data analytics, and machine learning. Participants will learn concepts from the fields where the datasets came from as well, and build a foundation on how to reverse engineer data pipelines and other processes they find in the wild.
Audience
The target audience for this session includes analysts of all level, developers, data scientists and engineers wanting to learn how to create data pipelines for their work.
Format
The tutorial has a setup section, three major lessons of 50 minutes each, and 2 breaks of 10 minutes after each at the end of lesson 1 and 2. In addition, each of the major three sections contain exercises designed to help solidify the content taught to participants.
Prerequisites (P) and Good To Have’s (GTH)
- (P) Attendees for this tutorial are expected to be familiar with Python (1 year of coding).
- (P) Participants should be comfortable with loops, functions, lists comprehensions, and if-else statements.
- (GTH) While it is not necessary to have knowledge of Prefect, MLFlow, Scikit-Learn, XGBoost, FastAPI, pandas, and the HoloViz suite of libraries, a bit of experience with these libraries would be very beneficial throughout this tutorial.
- (P) Participants should have at least 5 GB of free space in their computers.
- (GTH) While it is not required to have experience with an integrated development environment like Jupyter Lab, this would be very beneficial for the session.
Outline
Total time budgeted including breaks - 3.5 hours
- Introduction and Setup (~10 minutes)
- Getting the environment set up. We will be using Jupyter Lab throughout but participants experiencing difficulties throughout the session will also have the option to walk through the tutorial using Binder
- Quick breakdown of the session
- Flash instructor intro
- Data Engineering Pipelines (~40 minutes)
- Intro to the datasets
- ETL Pipeline Breakdown
- Exercise (7-min)
- 10-minute break
- Data Analytic Pipelines (~50 minutes)
- Intro to the dataset
- Interactive dashboard creation and customization
- Dashboard, main functions breakdown, and pipeline creation
- Exercise (7-min)
- 10-minute break
- Machine Learning Pipelines (~50 minutes)
- Intro to the dataset
- ML Pipelines breakdown
- Model development
- Exercise (7-min)
Additional Notes
I work as an educator, researcher, and data scientist, and have taught Python to hundreds of students with backgrounds ranging from complete beginner to advanced. My lessons are full of metaphors, quotes, funny pictures and exercises to make sure students leave my sessions at least with a laugh, a new concept learned, a new Python trick, or all of the above.
I have done several short tutorials at meetups on a variety of topics within data analytics and Python programming. Most recently, I taught one of the 3-hour tutorials at SciPy Japan 2020 using a bottom up approach and I am excited about the opportunity to do another session with a reverse approach.