SDD DE Data Distribution Course

Guillaume Eynard-Bontemps, Hugues Larat, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2026-01

Welcome

Course Overview

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google require dedicated computing solutions (both software and infrastructure), which will be explored in this module.

We’ll also take a dive in new programming and infrastructure technologies that emerged from these concepts.

Objectives

By the end of this module, participants will be able to:

  • Understand the differences and usages of main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
  • Implement the distribution of simple operations via the Map/Reduce principle in PySpark and Dask
  • Understand the principle of Kubernetes
  • Deploy a Big Data Processing Platform on the Cloud
  • Implement the distribution of data wrangling/cleaning and training machine learning algorithms using PyData stack, Jupyter notebooks and Dask

About us

  • Guillaume
  • CNES (Centre National d’Etudes Spatiales - French Space Agency)
  • Since 2016:
    • 6 years in CNES Computing Center team
    • 1 year of holydays
    • 3 year in developping image processing tools
    • 8 years of using Dask/Python on HPC and in the Cloud
    • A bit of Kubernetes and Google Cloud
  • Before that: 5 years on Hadoop and Spark
  • Originally: Software developer (a lot of Java)
  • Hugues
  • CNES (Centre National d’Etudes Spatiales - French Space Agency)
  • Since 2020:
    • Cloud specialist
    • Ground Segment Engineer
  • Before that:
    • System Architect
    • 8 years as Software Enginner and Tech Lead (a lot of Java)
    • 6 years as System & Network Technician

First quizz

I’ll try to propose some quizz to be sure you’re following!

Let’s try it.

What is this course module main subject?

  • Answer A: Cloud computing
  • Answer B: Data Distribution
  • Answer C: Machine Learning
Answer

Program

Big Data & Distributed Computing (3h)

Deployment & Intro to Kubernetes (3h)

MLOps: deploying your model as a Web App

Kubernetes hands on (3h)

  • Zero to Jupyterhub: deploy a Jupyterhub on Kubernetes
  • Deploy a Daskhub: a Dask enabled Jupyterhub (for later use)

Slides

Python Dataprocessing and Spark hands on (3h)

  • The rise of the Python ecosystem for Data Processing (1h)
    • Data Science programming languages
    • Pydata stack (Pandas, Numpy, Matlplotlib, Jupyter)
    • Distributed and sctientific processing (Dask, PySpark)
    • Data vizualisation (Seaborn, Plotly, Pyviz)
    • Machine and Deep Learning (Sickit Learn, TensorFlow, Pytorch)
    • Jupyter notebooks, Binder, Google Colab
  • Spark Introduction (30m)
  • Play with MapReduce using Spark (Notebook on small datasets) (1.5h)

Distributed Processing and Dask hands on (3h)

Evaluation: DaskHub, Dask preprocess, ML training (3h + self paced)

  • Evaluation introduction (1h)
    • Subject presentation
    • Everyone should have a Daskhub cloud platform setup or Dask on local computer
    • Get the data
  • Notebook with codes cell to fill and answers to give
    • Clean big amounts of data using Dask in the cloud or on a big computer
    • Train machine learning models in parallel (hyper parameter search)
    • Complete with yor own efforts!

Quizz

What will we do today?

  • Answer A: Explain what’s Big Data
  • Answer B: Dig into Cloud computing
  • Answer C: Take a nap
  • Answer D: Play with Kubernetes
  • Answer E: Some Spark
Answer

Answer link