CNES Big Data Processing & Distribution Course

Guillaume Eynard-Bontemps, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2022-04-08

Welcome

Course Overview

  • Big Data Processing & Distribution

Harnessing the complexity of large amounts of data is a challenge in itself.

But Big Data processing is more than that: originally characterized by the 3 Vs of Volume, Velocity and Variety, the concepts popularized by Hadoop and Google require dedicated computing and storage solutions (both software and infrastructure), which will be explored in this module.

Objectives

By the end of this module, participants will be able to:

  • Understand the differences and usage between main distributed computing architectures (HPC, Big Data, Cloud, CPU vs GPGPU)
  • Implement the distribution of simple operations via the Map/Reduce principle in PySpark
  • Understand the principle of Kubernetes and object storage
  • Deploy a Big Data Processing Platform (Pangeo) on the Cloud
  • Either implement the distribution of data wrangling/cleaning and training machine learning algorithms using Dask
  • Or implement a NDVI processing at scale using Pangeo platform.

About myself

  • Guillaume Eynard-Bontemps
  • CNES (Centre National d’Etudes Spatiales - French Space Agency)
  • Since 2016:
    • 6 years in CNES Computing Center team
    • Technical lead of the team
    • 3 years of using Dask/Python on HPC and in the Cloud
    • A bit of Kubernetes and Google Cloud
  • Before that: 5 years on Hadoop and Spark
  • Originally: Software developer (a lot of Java)

First quizz

I’ll try to propose some quizz to be sure you’re following!

Let’s try it.

What is this course module main subject?

  • Answer A: Cloud computing
  • Answer B: Data Distribution & Processing
  • Answer C: Machine Learning
Answer

Answer link Key: nf

Program

Big Data & Distributed Computing (3.5h)

Cloud, Kubernetes & Object stores (3.5h)

Python ecosystem for data processing and Dask at scale (3.5h)

Quizz

What will we do today (multiple choices)?

  • Answer A: Explain what’s Big Data
  • Answer B: See what’s Hadoop
  • Answer C: Dig into Cloud computing
  • Answer D: Take a nap
  • Answer E: Some Spark
Answer

Answer link Key: dy