SDD DE Data Distribution Course

Guillaume Eynard-Bontemps, Hugues Larat, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2025-01

Welcome

Course Overview

Data Distribution & Big Data Processing

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google require dedicated computing solutions (both software and infrastructure), which will be explored in this module.

Objectives

By the end of this module, participants will be able to:

Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
Implement the distribution of simple operations via the Map/Reduce principle in PySpark and Dask
Understand the principle of Kubernetes
Deploy a Big Data Processing Platform on the Cloud
Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask

About us

Guillaume
CNES (Centre National d’Etudes Spatiales - French Space Agency)
Since 2016:
- 6 years in CNES Computing Center team
- 1 year of holydays
- 2 year in developping image processing tools
- 7 years of using Dask/Python on HPC and in the Cloud
- A bit of Kubernetes and Google Cloud
Before that: 5 years on Hadoop and Spark
Originally: Software developer (a lot of Java)

Hugues
CNES (Centre National d’Etudes Spatiales - French Space Agency)
Since 2020:
- Cloud specialist
- Ground Segment Engineer
Before that:
- System Architect
- 8 years as Software Enginner and Tech Lead (a lot of Java)
- 6 years as System & Network Technician

First quizz

I’ll try to propose some quizz to be sure you’re following!

Let’s try it.

What is this course module main subject?

Answer A: Cloud computing
Answer B: Data Distribution
Answer C: Machine Learning

Answer link Key: od

Program

Big Data & Distributed Computing (3h)

Introduction to Big Data and its ecosystem (1h)
- What is Big Data?
- Legacy “Big Data” ecosystem
- Big Data use cases
- Big Data to Machine Learning
Big Data platforms, Hadoop & Beyond (1h30)
- Hadoop, HDFS and MapReduce,
- Datalakes, Data Pipelines
- From HPC to Big Data to Cloud and High Performance Data Analytics
- BI vs Big Data
- Hadoop legacy: Spark, Dask, Object Storage …
Object Storage (30m)

Deployment & Intro to Kubernetes (3h)

MLOps: deploying your model as a Web App
Introduction to Orchestration
Introduction to Kubernetes

Kubernetes hands on (3h)

Zero to Jupyterhub: deploy a Jupyterhub on Kubernetes
Deploy a Daskhub: a Dask enables Jupyterhub (for later use)

Slides

Python Dataprocessing and Spark hands on (3h)

The rise of the Python ecosystem for Data Processing (1h)
- Data Science programming languages
- Pydata stack (Pandas, Numpy, Matlplotlib, Jupyter)
- Distributed and sctientific processing (Dask, PySpark)
- Data vizualisation (Seaborn, Plotly, Pyviz)
- Machine and Deep Learning (Sickit Learn, TensorFlow, Pytorch)
- Jupyter notebooks, Binder, Google Colab
Spark Introduction (30m)
Play with MapReduce through Spark (Notebook on small datasets) (1.5h)

Distributed Processing and Dask hands on (3h)

Evaluation: DaskHub, Dask preprocess, ML training (6h)

Evaluation introduction (1h)
- Subject presentation
- Everyone should have a Daskhub cloud platform setup or Dask on local computer
- Get the data
Notebook with cell codes to fill or answers to give
- Clean big amounts of data using Dask in the cloud or on a big computer
- Train machine learning models in parallel (hyper parameter search)
- Complete with yor own efforts!

Quizz

What will we do today?

Answer A: Explain what’s Big Data
Answer B: Dig into Cloud computing
Answer C: Take a nap
Answer D: Play with Kubernetes
Answer E: Some Spark

Answer link Key: si