PyCon South Africa 2022

October 12, 2022 10:30
PyCon South Africa 2022
python data devops data-engineering

Bridging the Gap Between DevOps and Data Science

Here is the link to the repo with the set up instructions for the workshop.

Elevator Pitch

If you’ve ever wondered how could Data Professionals and DevOps Engineers adopt each other’s best practices in their common language, Python, then this workshop is for you. The goal is to teach you the best automation recipes from each discipline and, ultimately, make you a better programmer.

Abstract

Programmers, regardless of their level of experience, enjoy solving increasingly complex challenges within their domain of expertise, and one of the main reasons they can spend more time working through such challenges is because of the automation recipes they have built around their workflows. Data Analysts, Engineers and Scientists automate the initial steps of inspecting, cleaning, and analysing new data sets while DevOps Engineers automate everything from the filesystem to the infrastructure of software products. These groups of (data and engineering) professionals are not too foreign to each other as they all speak the same language, Python. That said, the goal of this workshop is to bring the automation recipes from the data world into the DevOps world and vice-versa. In other words, to bring all the slang and word abbreviations both groups use — in code — and create a dialect that welcomes both, newcomers and experts to either field.

In this workshop, we’ll cover 4 major automation recipes from the Analytics and DevOps worlds to make you a more efficient and systematic programmer. Each section will last about 45-minutes, and the topics covered range from automating extract, transform, and load (ETL) pipelines to creating your own internal platform tools. By the end of the workshop, you will be able to speak some DevOps to your data professional colleagues, and some analytics to your engineering team (slang words included).

Audience

The target audience for this session includes analysts of all levels, developers, data scientists and engineers wanting to learn automation and best practice recipes to increase their productivity with Python, and as programmers in general.

Format

The tutorial has a setup section, four major lessons of ~45 minutes each, and 3 breaks of 10 minutes each. In addition, each of the major sections contains some allotted time for exercises that are designed to help solidify the content taught throughout the workshop.

Prerequisites (P) and Good To Have’s (GTH)

  • (P) Attendees for this tutorial are expected to be familiar with Python (1 year of coding).
  • (P) Participants should be comfortable with loops, functions, lists comprehensions, and if-else statements.
  • (GTH) While it is not necessary to have any knowledge of data analytics libraries, some experience with pandas, NumPy, matplotlib and scikit-learn, a bit of experience with these libraries would be very beneficial throughout this tutorial.
  • (P) Participants should have at least 5 GB of free space in their computers.
  • (GTH) While it is not required to have experience with integrated development environments like VS Code or Jupyter Lab, this would be very beneficial for the session.

Outline

Total time budgeted (including breaks) - 4 hours

  1. Introduction and Setup (~20 minutes)
    • Getting the environment set up. Participants can choose between VS Code or Jupyter Lab and those experiencing difficulties throughout the session will also have the option to walk through the workshop using an isolated environment in Binder.
    • Flash instructors intro.
    • Motivation for the workshop.
    • Analytics vs Engineering.
    • Quick breakdown of the session.
  2. Recipe 1: Automating your data cleaning pipelines (~45 minutes)
    • Intro to the datasets.
    • Cleaning Pipelines.
    • ETL pipelines with pandas.
    • Exercise (7-min).
  3. 10-minute break
  4. Recipe 2: Automating your development tools (~45 minutes)
    • Making the file system work for you, always.
    • Creating command-line tools.
    • Exercise (7-min).
  5. 10-minute break
  6. Recipe 3: Automating your infrastructure tools (~45 minutes)
    • How to think of your infrastructure in terms of blocks of Python code.
    • Automating your unit tests.
    • Building vs reusing vs buying.
    • Exercise (7-min).
  7. 10-minute break
  8. Recipe 4: Automating your analytics pipeline (~45 minutes)
    • Intro to the dataset.
    • Descriptive Statistics Pipeline.
    • Statistical Modeling Pipeline.
    • Exercise (7-min).