Guillaume Eynard-Bontemps, CNES (Centre National d’Etudes Spatiales - French Space Agency)
2020-11-15
What is Hadoop?
Open source framework supported by Apache foundation:
Numerous Apache Software Foundation projects:
Hadoop distributions!
Each cluster is composed of:
Spark first version in 2014.
What are the two building blocks of Hadoop ecosystem (multiple choices)?
Answer link Key: cf
What means HDFS?
Answer link Key: wl
What is the magical hidden step of distributed Map Reduce?
Answer link Key: ec
What is the goal of a Datalake?
Answer link Key: bp
You won’t usually achieve what you want with a single MapReduce or Spark job.
Let’s say you want to train a ML model every time a text file is updated on a website and evaluate it next.
You’ll need to:
This is called a Pipeline or a workflow.
It mainly means chaining tasks or jobs together to automatically produce a result from an input.
Tasks are typically either:
It is usually represented by Direct Acyclic Graphs (DAGs).
Plenty others from Apache or in Python ecosystem.
HPC = High Performance Computing
Current speed in eNATL60 simulation with explicit tidal motion. from Océan Numérique on Vimeo.
Several things: Login nodes, Admin/Scheduler nodes, Compute resources, Parallel FS, RMDA Network
#!/bin/bash
#SBATCH --job-name=serial_job_test # Job name
#SBATCH --ntasks=1 # Run on a single CPU
#SBATCH --mem=1gb # Job memory request
#SBATCH --time=00:05:00 # Time limit hrs:min:sec
#SBATCH --output=serial_test_%j.log # Standard output and error log
module load python
python /data/training/SLURM/plot_template.py
HPC = High Performance Computing
Rank | System | Cores | Rmax (TFlop/s) | Rpeak (PFlop/s) | Power (kW) |
---|---|---|---|---|---|
1 | Frontier - United States | 8,699,904 | 1,194.00 | 1,679.82 | 22,703 |
2 | Aurora - United States | 4,742,808 | 585.34 | 1,059.33 | 24,687 |
4 | Supercomputer Fugaku - Japan | 7,630,848 | 442.01 | 537.21 | 29,899 |
5 | LUMI - Finland | 2,752,704 | 2379.70 | 531.51 | 7,107 |
17 | Adastra - France | 319,072 | 46.10 | 61.61 | 921 |
167 | Jean Zay - France | 93,960 | 4.48 | 7.35 |
Hence the cloud computing model…
How Big Data processing differs from classical HPC (multiple choices)?
Answer link Key: zl
Business intelligence (BI) comprises the strategies and technologies used by enterprises for the data analysis of business information.
BI technologies provide historical, current, and predictive views of business operations.
Wikipedia
Business Intelligence can use Big Data ecosystems, but is more commonly considered something different.
https://medium.com/doctolib/data-engineers-are-no-longer-data-folks-d10d44712580 https://alphalyr.fr/blog/difference-bi-business-intelligence-big-data/
Not quite yet. Still used in many places
It grew up with web giants, producing a really rich and open source ecosystem
But clearly the two main components (HDFS and MapReduce) are now deprecated
And have paved the way to better alternatives
Infrastructure: Private or public cloud, and HPC(DA) in some cases.
HDFS? Object Storage!
MapReduce? Spark, Dask.
Chunked file format (SequenceFile)? Parquet, Zarr, Cloud optimized Geotiff.
YaRN? HPC job scheduler, or Kubernetes
What technologies are replacing Hadoop ecosystem (multiple choices)?
Answer link Key: mz