The rise of the Python ecosystem for Data Processing

Guillaume Eynard-Bontemps and Emmanuelle Sarrazin, CNES (Centre National d’Etudes Spatiales - French Space Agency)

2024-01

Data Science programming languages

R

  • Programming language and free software environment
  • Open source
  • Interactive
  • Ecosystem
    • Statistical computing
    • Graphics, vizualisation
    • Data analysis
R Studio

Julia

  • Fast: designed for high performance
  • Open source
  • Dynamically typed, interactive use
  • Ecosystem
    • Scientific and parallel computing
    • Visualisation and plotting
    • Data science and machine learning

C/C++

  • Static languages
  • Not much visualization
  • For under layers of use libraries
  • Easy to interface with Python (Cython, pybind11)

Lua

  • Lightweight, high-level, multi-paradigm programming language
  • Designed primarily for embedded use in applications
  • Cross-platform
  • For under layers of use libraries
  • C API

Java

  • Static languages
  • Not much visualization
  • Not completely compatible with IEEE Standard 754 Floating Points Numbers

Matlab and others

Matlab (and equivalent Scilab)

  • Interactive
  • With IDE and plotting
  • Closed, not reproducible
  • For some researchers

Python

  • Created in 1991
  • Interpreted and so interactive language
  • Really simple syntax (Code readability)
  • Dynamically typed and garbage-collected
  • Supports multiple programming paradigms:
    • structured (particularly procedural),
    • object-oriented and
    • functional programming

Python

  • High-level and general-purpose programming language
  • Many, many (many) libraries
    • A lot of scientific ones!
  • Ecosystem
    • Scientific and parallel computing
    • Visualisation and plotting
    • Machine Learning, Deep Learning
    • Web developement

Python the most used language?

Kaggle Languages Popularity

Kaggle IDE Popularity

Quizz

What is the most used language (in Data Science)?

  • Answer A: R
  • Answer B: Go
  • Answer C: Python
  • Answer D: Matlab
Answer

Answer link Key: ay

Python scientific ecosystem

Core (Numpy, SciPy, Pandas …)

Numpy

  • Manipulate N-dimensionnal arrays
  • Numerical computing tools :
    • math functions
    • linear algebra
    • Fourier transform
    • random number capabilities
    • etc
  • Performant: core is well-optimized C/C++ and Fortran code
  • Easy and de facto standard syntax

Nearly every scientist working in Python draws on the power of NumPy

# The standard way to import NumPy:
import numpy as np

# Create a 2-D array, set every second element in
# some rows and find max per row:

x = np.arange(15, dtype=np.int64).reshape(3, 5)
x[1:, ::2] = -99
x
array([[  0,   1,   2,   3,   4],
       [-99,   6, -99,   8, -99],
       [-99,  11, -99,  13, -99]])
x.max(axis=1)
array([ 4,  8, 13])

# Generate normally distributed random numbers:
rng = np.random.default_rng()
samples = rng.normal(size=2500)

Scipy

  • Use Numpy arrays as basic data structure
  • Offer scientific functions :
    • Optimization
    • Interpolation
    • Signal processing
    • Linear algebra
    • Statistics
    • Image processing
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
rng = np.random.default_rng()
xi = 0.1*np.arange(1,11)
yi = 5.0*np.exp(-xi) + 2.0*xi
zi = yi + 0.05 * np.max(yi) * rng.standard_normal(len(yi))
A = np.concatenate((np.exp(-xi)[:, np.newaxis], xi[:, np.newaxis]),axis=1)
c, resid, rank, sigma = linalg.lstsq(A, zi)
xi2 = np.arange(0.1,1.01,0.01)
yi2 = c[0]*np.exp(-xi2) + c[1]*xi2

Pandas

  • Deal with Dataseries and Dataframes (e.g. tables)
  • Data manipulation and analysis
    • Selection
    • Grouping
    • Merge
    • Statistics
    • Transformation
  • Numerical tables and time series
  • Extension to geospatial data with geopandas

import pandas as pd
pd.read_csv('Myfile.csv')
pd.describe()

Xarray

  • Manipulate N-dimensionnal labelled arrays and datasets
  • Introduce dimensions, coordinates and attributes on top of Numpy
  • Borrows heavily from Pandas

Quizz

Which tools allows manipulating tabular data?

  • Answer A: Numpy
  • Answer B: Xarray
  • Answer C: Pandas
  • Answer D: Scipy
Answer

Answer link Key: qa

Visualization

Landscape

Adaptation of Jake VanderPlas graphic about the Python visualization landscape, by Nicolas P. Rougier

Matplotlib

  • Base/Reference plotting library
  • For Python and Numpy
  • Static, animated, and interactive visualizations
  • Designed to be as usable as MATLAB
fig, ax = plt.subplots(subplot_kw={"projection": "3d"})

# Plot the surface.
surf = ax.plot_surface(X, Y, Z, cmap=cm.coolwarm,
                       linewidth=0, antialiased=False)

# Customize the z axis.
ax.set_zlim(-1.01, 1.01)
ax.zaxis.set_major_locator(LinearLocator(10))
# A StrMethodFormatter is used automatically
ax.zaxis.set_major_formatter('{x:.02f}')

# Add a color bar which maps values to colors.
fig.colorbar(surf, shrink=0.5, aspect=5)

plt.show()

Seaborn

  • Based on Matplotlib
  • Integrates closely with Pandas
  • Dataset oriented to produce informative plots

Plotly

  • Interactive, publication-quality graphs
  • Make dashboard with Dash

Bokeh

  • Interactive, publication-quality graphs
  • Make dashboard with Dash

Pyviz

  • HoloViews: Declarative objects for instantly visualizable data, building Bokeh plots from convenient high-level specifications
  • GeoViews: Visualizable geographic data that that can be mixed and matched with HoloViews objects
  • Panel: Assembling objects from many different libraries into a layout or app, whether in a Jupyter notebook or in a standalone serveable dashboard
  • Datashader: Rasterizing huge datasets quickly as fixed-size images
  • hvPlot: Quickly return interactive HoloViews or GeoViews objects from your Pandas, Xarray, or other data structures
  • Param: Declaring user-relevant parameters, making it simple to work with widgets inside and outside of a notebook context

Machine and Deep Learning

Kaggle stats

Machine Learning Frameworks usage

Sickit Learn

  • Simple and efficient tools for predictive data analysis
  • Built on NumPy, SciPy, and matplotlib
  • Every classical ML Algorithms
  • Standard interface with Pipelines, estimators, transformers
  • No GPU support (so not good for Deep Learning)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=0)
X = [[ 1,  2,  3],  # 2 samples, 3 features
     [11, 12, 13]]
y = [0, 1]  # classes of each sample
clf.fit(X, y)

Sickit Learn

TensorFlow, Keras

  • Deep Learning on GPU with no previous knowledge
  • Keras on top of Tensorflow
  • Tensorflow complete platform, with TensorBoard and other tools

Pytorch

  • Deep Learning on GPU with no previous knowledge
  • Always trolls about Keras/TF vs PyTorch
  • Additional librairies:
    • pytorch-lightning
    • pytorch3d

Gradient boosting algorithms

XGBoost

  • Distributed gradient boosting library
  • Efficient, flexible and portable
  • XGBoost provides a parallel tree boosting
  • Runs on major distributed environment (Hadoop, SGE, MPI, Spark)
  • Solve problems beyond billions of examples

LighGBM

  • Distributed gradient boosting framework
  • Efficient, Faster, lower memory usage, better accuracy
  • Support of parallel, distributed, and GPU learning
  • Capable of handling large-scale data

Data Version Control

  • Version your data and models: Store them in your cloud storage but keep their version info in your Git repo.
  • Track experiments in your local Git repo (no servers needed).
  • Share experiments and automatically reproduce anyone’s experiment.

MLFlow

  • Tracking experiments to record and compare parameters and results (MLflow Tracking).
  • Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects).
  • Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models).
  • Providing a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations (MLflow Model Registry).

MLFlow

Quizz

Which is the best Deep Learning library in Python?

  • Answer A: Sickit-Learn
  • Answer B: Keras
  • Answer C: TensorFlow
  • Answer D: PyTorch
  • Answer E: XGBoost
Answer

Answer link Key: ca

Others scientific libraries

Sympy

  • Library for symbolic mathematics
  • Simplification, Calculus, Solvers
from sympy import symbols
x, y = symbols('x y')
expr = x + 2*y

Shapely

  • Library for manipulation and analysis of planar geometric objects
import shapely
import numpy as np

geoms = np.array([Point(0, 0), Point(1, 1), Point(2, 2)])
polygon = shapely.box(0, 0, 2, 2)

shapely.contains(polygon, geoms)

Pandas Extension

GeoPandas

  • For manipulating geospatial data in python easier
  • Provide geospatial operations in pandas:
    • Measure areas and distances
    • Compute intersections/unions
    • Make maps and plots

Text Extensions for Pandas

  • Add NLP-specific data types, operations, and library integrations to Pandas
  • Make it easier to manipulate and analyze NLP-related data with Pandas

Development Tools

Jupyter (Lab and Notebook)

  • Open source web application
  • Create and share documents that contain live code
  • Equations, visualizations and narrative text
  • Interactive programming and visualizing
  • Usage:
    • data cleaning and transformation,
    • numerical simulation,
    • statistical modeling,
    • data visualization,
    • machine learning
  • Used by Google Colab or Kaggle

VSCode

  • Source-code editor developed by Microsoft for Windows, Linux and macOS.
  • Features include support for
    • debugging,
    • syntax highlighting,
    • intelligent code completion,
    • snippets,
    • code refactoring,
    • testing and
    • embedded Git.
  • Lots of extensions that add functionality.

PyCharm

  • IDE used for programming in Python
  • Cross-platform, working on Microsoft Windows, macOS and Linux
  • Features include support for
    • Code analysis,
    • Graphical debugger,
    • Integrated unit tester,
    • Integration with version control systems

Packaging

Pip / Conda

  • Package libraries
  • Make them available on repositories
  • Build environments automatically

Packaging: Pip / Conda

Difference between Conda and Pip according to Anaconda.

conda pip
manages binaries wheel or source
can require compilers no yes
package types any Python-only
create environment yes, built-in no, requires virtualenv or venv
dependency checks yes no

Others

Binder

Turn a Git repo into a collection of interactive notebooks

Exercises

Tutorials

Pandas tutorial

Follow this first tutorial at least till chapter 6.

Scikit-learn tutorial

See the instruction to run the notebooks locally here If you have time, go through part “The predictive modeling pipeline” with notebooks 01 to 03. You can use the online book