# Methods for data-driven modelling

Modern physics is characterized by an increasing complexity of systems under investigation, in domains as diverse as condensed matter, astrophysics, biophysics, etc. Due to the growing availability of experimental data, data-driven modelling is emerging as a powerful way to model those systems. The objective of the course is to provide the theoretical concepts and practical tools necessary to understand and to use these approaches.

Complex systems, characterized by pervasive, non-homogenous and strong interactions,out-of-equilibrium dynamical effects, are extremely challenging to model. Determining the relevant degrees of freedom, how they interact and shape the collective behaviour of these systems is often out of reach with first-principle approaches. Data-driven modelling is emerging as an alternative approach in many fields, at the origin of recent breakthrough in protein folding for instance. Their use raise many questions, from the statistical (about the quality and quantity of data necessary for reaching good results), the physical (about the interpretability of those models, and how they reveal relevant mechanisms), and the computational (about the efficiency and complexity of the algorithms) points of view.

The objectives of this course are two-fold. First, we will provide statistical inference and machine learning tools to extract information and learn models from data. The lectures will start from the basics in Bayesian inference, and then present important concepts and tools in unsupervised and supervised learning. The emphasis will be put on the connections with statistical physics.

Second, each theoretical lecture will be followed by a tutorial illustrating the concepts with practical applications borrowed from various domains of physics or data science. We will focus on methods and on the interpretation of the results, not on programming and heavy numerics!

Week 1:** ****What is Bayesian inference?**

*Bayes' rule, notions of prior, likelihood and posterior, two historical illustrations: the German Tank and the Boy/Girl Birth Rate problems*

Tutorial: Diffusion coefficient from single-particle tracking

Week 2: **Asymptotic inference**

*Rate of convergence, Kullback-Leibler divergence, Fisher information, variational inference, illustration: mean field in stat. mech.*

Tutorial: Counting photons in a QED cavity from quantum trajectories of atoms (1)

Week 3: **Entropy and** **information - Application to dimensional reduction**

*Shannon's entropy, principle of maximum entropy, mutual information, principal and independent component analysis*

Tutorial: Counting photons in a QED cavity from quantum trajectories of atoms (2)

Week 4: **Phase transition in high-dimensional settings: principal component analysis**

*Spiked covariance model, large dimensional setting & spectrum of random correlation matrices, the phase transition, when is learning retarded?*

Tutorial: Replay of neural activity during sleep following task learning

Week 5: **Phase transition in high-dimensional settings: regression**

*Linear regression, L2 prior, cross-validation, harmful and benign overfitting in high-dimensional inference*

Tutorial: Characterization of colliding supernovae from gravitational waves (1)

Week 6: **Priors: sparsity and beyond**

*L1 prior, conjugated priors and pseudo-counts, shrinkage, universal priors*

Tutorial: Characterization of colliding supernovae from gravitational waves (2)

Week 7: **Graphical models: learning many interactions**

*Boltzmann Machines (BM), Monte Carlo sampling, Convexity of log-likelihood, BM Learning, Mean-field inference, Pseudo-likelihood method*

Tutorial: Inferring structural contacts from protein sequences (1)

Week 8: **Unsupervised learning: representations and generation**

*Notion of representation, Autoencoders, restricted Boltzmann machines, Auto-regressive models*

Tutorial: Inferring structural contacts from protein sequences (2)

Week 9: **Supervised learning: support vector machines**

*Linear classifiers, enumeration of dichotomies, perceptron learning algorithm, Kernel methods*

Tutorial: Interpretable representations of 2D disks by auto-encoders

Week 10:** Supervised learning: learning curves and multilayer nets**

*Statistical mechanics of one- and two-layer neural nets*

Tutorial: Classification of MNIST digits

Week 11: **Learning from streaming data**

*On-line classification, on-line PCA (Oja's rule) and sparse PCA*

Tutorial: TBA

Week 12: **Time series analysis (1): hidden Markov models**

*Markov and hidden Markov processes, Transfer matrix calculations, Viterbi algorithm, Expectation-Maximization procedure*

Tutorial: Identification of recombination events in SARS-CoV-2

Week 13: **Time series analysis (2): recurrent neural nets**

*Approximation theorem, low-dimensional rank nets: justification and analysis, Some applications*

*Tutorial: TBA*

Basic level in statistical physics. The program language that we use is Python 3, but no previous experience in programming is required.

From a practical point of view, make sure your computer is properly setup for the course. Here are some (brief) instructions of what you should do:

1. Install Anaconda, following the instructions here: __https://www.anaconda.com/download__

2. Then install Pytorch. To do this, open a terminal and run:

*conda install pytorch torchvision torchaudio cpuonly -c pytorch*

Make sure you are able to load all packages. To test this, you can start Python, and run:

*>> import torch, numpy, scipy, matplotlib*

This should not produce any errors.

Homework at the end of October

Written exam on the theoretical part of the course + the practical (computational and statistical inference) aspects at the beginning of January