Big Data Processing Course Introduction

Guillaume Eynard-Bontemps, Emmanuelle Sarrazin, Hugues Larat, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2024-02-24

Welcome

Course Overview

  • Big Data Processing

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google requires dedicated computing solutions (both software and infrastructure), which will be explored in this module.

Objectives

By the end of this module, participants will be able to:

  • Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
  • Implement the distribution of simple operations via the Map/Reduce principle in PySpark
  • Connect on a cloud computing engine (e.g. Google Cloud Platform) and use it
  • Understand the principle of containers (through Docker) and Kubernetes
  • Deploy a Big Data Processing Platform on the Cloud
  • Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask

Typical daily schedule

Time slot Content
9:00-10:30 Slides, tutorial or exercises
10:30-10:45 Coffee Break
10:45-12:15 Slides, tutorial or exercises
12:15-13:30 Lunch (I know it’s a bit short)
13:30-15:15 Slides, tutorial or exercises (not nap)
15:15-15:30 Coffee Break (we may also make two breaks)
15:30-17:15 Slides, tutorial or exercises (last session, at last)

I’ll try to propose some quizz to be sure you’re following!

About myself

  • Guillaume
  • CNES (Centre National d’Etudes Spatiales - French Space Agency)
  • Since 2016:
    • 6 years in CNES Computing Center team
    • 1 year of holydays
    • 1 year in developping image processing tools
    • 6 years of using Dask/Python on HPC and in the Cloud
    • A bit of Kubernetes and Google Cloud
  • Before that: 5 years on Hadoop and Spark
  • Originally: Software developer (a lot of Java)

About others

  • Hugues
  • CNES (Centre National d’Etudes Spatiales - French Space Agency)
  • Since 2020:
    • Cloud specialist
    • Ground Segment Engineer
  • Before that:
    • System Architect
    • 8 years as Software Enginner and Tech Lead (a lot of Java)
    • 5 years as System & Network Technician
  • Emmanuelle
  • CNES (Centre National d’Etudes Spatiales - French Space Agency)
  • Since 2013:
    • 6 years, HPC Expert
    • 5 years Image processing, 3D

About yourselves

What are the previous courses you’ve followed in this master?

What are you familiar with in the big data, cloud, Python and machine learning subjects?

First quizz

Let’s try the Quizz mechanism.

What is this course module main subject?

  • Answer A: Cloud computing
  • Answer B: Big Data Processing
  • Answer C: Machine Learning
Answer

Answer link Key: ah

Program

Day 1: Big Data, Distributed Computing and Spark

Day 2: Cloud Computing and Kubernetes

Day 3 (morning): Deploy your own processing platform on Kubernetes

  • Deploy a Jupyterhub on Kubernetes
    • Exercise: Zero to Jupyterhub: deploy a Jupyterhub on Kubernetes
  • Deploy a Data processing platform on the Cloud based on Kubernetes and Dask (3h)

Day 3 (afternoon): Python ecosystem for data processing

Day 4: Python for distributed processing

Day 5: Evaluation

  • Final Evaluation
    • Prerequisite: Pangeo platform deployed before (on day 2 and 3)
    • Clean big amounts of data using Dask in the cloud (3h)
    • Train machine learning models in parallel (hyper parameter search) (3h)
    • Notebook with cell codes to fill or answers to give

Quizz

What will we do today (multiple choices)?

  • Answer A: Explain what’s Big Data
  • Answer B: See what’s Hadoop
  • Answer C: Dig into Cloud computing
  • Answer D: Take a nap
  • Answer E: Some Spark
Answer

Answer link Key: mv