Introduction to Big Data and its Ecosystem

Guillaume Eynard-Bontemps, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2020-11-15

What is Big Data?

Data evolution

1 ZB
1,000,000 PB
1,000,000,000,000 GB
1,000,000,000,000,000,000,000 B

Evolution of the global datasphere size (IDC)

Some figures

Volume of data produced in a day in 2019 (source www.visualcapitalist.com)

Some figures in sciences

Earth Observation Data

Volume of data per year (source The Australian Geoscience Data Cube — Foundations and lessons learned, A. Lewis)

CERN

  • The LHC experiments produce about 90 petabytes of data per year
  • an additional 25 petabytes of data are produced per year for data from other (non-LHC) experiments at CERN
CERN current data volumes

You can have a look

3V, 4V, 5V

What is Behind Big Data

Data

Volume, variety, multiple sources, internal, external…

Tools and technology

Store, Compute, Analyse: Calculators, Cloud, Hadoop, Spark, Dask

Visualize, Use: Applications, Web interfaces

Definition (Wikipedia)

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

Big data is where parallel computing tools are needed to handle data.

Not a technology.

Quizz

What is the estimated size of the global data sphere?

  • Answer A: 175 Petabytes
  • Answer B: 175 Exabytes
  • Answer C: 175 Zetabytes
Answer

Answer link Key: ch

Quizz

Cite some V’s of Big Data (multiple choices):

  • Answer A: Validation
  • Answer B: Volume
  • Answer C: Velocity
  • Answer D: Voldemort
  • Answer E: Variety
Answer

Answer link Key: wn

Legacy “Big Data” ecosystem

Blowing ecosystem

Hadoop & Map Reduce

Hadoop ecosystem

NoSQL (Not only SQL)

SQL vs NoSQL databases model (altitudetvm)
Popular NoSQL Databases

Logs, ETL, Time series

Elastic stack
Grafana/Prometheus/InfluxDB
Dahboards

Dataviz

BI (softwares)

Classical BI tools

Python (libraries)

Python data vizualisation landscape

Data Science and Machine Learning

Machine Learning tools (Azure)

Quizz

Which technology is the most representative of the Big Data world?

  • Answer A: Spark
  • Answer B: Elasticsearch
  • Answer C: Hadoop
  • Answer D: Tensorflow
  • Answer E: MPI (Message Passing Interface)
Answer

Answer link Key: rv

Big Data use cases

Typical Dataset (originally)

Huge amount of small objects:

  • Billions of records
  • KB to MB range

Think of:

  • Web pages, and words into it
  • Tweets
  • Text files, where each line is a record
  • IoT and everyday life sensors: a record per second, minute or hour.

Cost effective storage and processing

  • Commodity hardware (standard servers, disks and network)
  • Horizontal scalability
  • Proximity of Storage and Compute
  • Secure storage (redundancy or Erasure Coding)

Use cases:

  • Archiving
  • Massive volume handling
  • ETL (Extract Transform Load)

Data mining, data value, data cross processing

Extract new knowledge and value from the data:

  • Statistics,
  • Find new Key Performance Indicators,
  • Explain your data with no prior knowledge (Data Mining)

Cross analysis of internal and external data, correlations:

  • Trends with news or social network stream and correlation to sales
  • Near real time updates with Stream processing

Scientific data processing

Data production or scientific exploration:

  • Stream processing, or near real time processing from sensor data
  • Distributed processing of massive volume of incomming data on computing farm
  • Data exploration and analysis
  • Data Science

Gaia: 150TB input, 6PB generated Iota2: 20-40TB of input data

Other main use cases

  • Digital twins
  • Predictive maintenance
  • Smart City
  • Real time processing
Airplane Digital Twin

Quizz

What is the typical volumes of scientific Datasets (multiple choices)?

  • Answer A: MBs
  • Answer B: GBs
  • Answer C: TBs
  • Answer D: PBs
  • Answer E: EBs
Answer

Answer link Key: pe

Big Data to Machine Learning

Big Data ecosystem allows (part of) machine learning to be effective

  • More data = more precise models
  • Deep Learning difficult without large (possibly generated) input datasets
  • Tools to collect, store, filter, index, structure data
  • Tools to analyse and visualize data
  • Real time model learning

https://blog.dataiku.com/when-and-when-not-to-use-deep-learning

Pre processing before machine learning

  • Data wrangling and exploration
  • Feature engineering: unstructured data to input features
  • Cross mutliple data sources
  • Get insights on the data before processing it (statistics, vizualisation)

Distribute datasets and algorithms

  • For preprocessing as seen above
  • Means to load and learn on large volumes by distributing storage
  • Distributed learning with data locality on big datasets
  • Distributed hyper parameter search
Lambda Architecture (Azure)

Quizz

Is Big Data and Machine Learning the same?

Answer

Answer link Key: tf