Logo Centrale paris
Logo Devenir Centralien - Construisez un avenir à la mesure de vos envies

Parcours Data Sciences

La prolifération des systèmes de gestion des données, ainsi que les progrès considérables réalisés au cours de la dernière décennie en termes de puissance de calcul ont contribué à la création d'une nouvelle discipline à l'intersection de l'informatique et des mathématiques appliquées : la science des données. L'objectif principal est de développer des modèles mathématiques et leurs solutions informatiques capables d’analyser et d'interpréter des quantités massives de données.

Le programme proposé vise à appréhender des méthodes scientifiques nouvelles d’analyse, de traitement et d’interprétation des « big data » tout en répondant aux défis de l’innovation. Dans ce cadre les étudiants devront développer leurs compétences et savoir-faire dans les disciplines telles que le machine learning, l'optimisation à grande échelle et le calcul distribué.

Objectif du parcours :

L'objectif principal est de développer des modèles mathématiques et leurs solutions informatiques capables d’analyser et d'interpréter des quantités massives de données.

L’exploitation, la compréhension et l'interprétation de ces données permettent aux entreprises d’innover en développant de nouveaux contenus, produits et services. Google, Facebook, Amazon s’appuient sur ces technologies pour développer leur offre de manière efficace et adaptée aux évolution du marché.

Les trois défis majeurs de la science :

  • La quantité de données qui sont souvent collectée de manière continue et apportent une masse d’informations colossale.
  • La grande diversité de nature des informations et leur hétérogénéité qui conduit à traiter des problèmes de grande dimension.
  • La nature des événements rares ou critiques que l'on doit être en mesure de déterminer malgré leurs apparitions non-uniformes et non fréquentes.

Cette proposition de formation vise à introduire des méthodes scientifiques nouvelles pour le traitement, l’interprétation et la compréhension des informations de "Big Data" tout en répondant aux défis mentionnés ci-dessus. À cette fin, l'apprentissage automatique, l'optimisation à grande échelle et le calcul distribué sont les disciplines de base de la formation scientifique dans l’ère de l'innovation numérique.

Témoignages d'élèves :

Afshine et Shervine Amidi
"Nous sommes constamment poussés à rechercher l'excellence dans ce que l'on entreprend."

Rudy Bunel

"Ce parcours m'a vraiment lancé dans la Data Science."

Jules L 

"Tout est parti du trading algorithmique."

Clément Nicolle

"Motivé par les applications liées à la santé, au voyage, à la culture."

Ian Cherabier

"I really felt my skills in mathematics and data science improve."

Équipe pédagogique :

  • Nikos Paragios - Full professor at the department of applied mathematics of Ecole Centrale de Paris and affiliated research scientist at Inria
  • Matthew Blaschko - Affiliated associate professor at the department of applied mathematics of Ecole Centrale de Paris.
  • Frédéric Cazals - Professor applied mathematics
  • Lionel Gabet - Professor at the department of applied mathematics of Ecole Centrale de Paris
  • Iasonas Kokkinos - Associate professor at the department of applied mathematics of Ecole Centrale de Paris and affiliated research at Inria.
  • Pawan Kumar - Associate professor at the department of applied mathematics of Ecole Centrale de Paris and affiliated research scientist at Inria.
  • Steve Oudot - Permanent research scientist at Inria and and affiliated adjunct professor at the department of computer science at Ecole Polytechnique at at the departement applied mathematics of Ecole Centrale de Paris.
  • Jean-Christophe Pesquet - Full of professor at the department of computer science of the University of Paris-East and affiliated adjunct professor at the department of applied mathematics of Ecole Centrale de Paris.
  • Émilie Chouzenoux - Assistant Professor with the University of Paris-East, Champs-sur-Marne, France (LIGM, UMR CNRS 8049).

Offre pédagogique :

Le parcours Data Sciences proposé par l’Ecole Centrale Paris débute dès la 2ème année du cursus ingénieur, il est adossé à l’option Mathématiques Appliquées, il s’articule autour de l’offre de cours suivante : 

Foundations of Deep Learning :

The advent of big data and powerful computers have made deep learning algorithms the current method of choice for a host of machine learning problems. Over the last few years deep learning systems have been beating with a large margin the previous state-of-the-art systems in tasks as diverse as speech recognition, image classification, and object detection.
Deep architectures are composed of multiple levels of non-linear operations, such as in neural nets with many hidden layers. Searching the parameter space of deep architectures is a difficult task, but learning algorithms such as those for Deep Belief Networks have recently been proposed to tackle this problem with notable success. This course will discuss the motivations and principles regarding learning algorithms for deep architectures, starting from the unsupervised learning of single-layer models such as Restricted Boltzmann Machines, and moving on to learning deeper models such as Deep Belief Networks.
The course consists of the following eight lectures (3h each), and a lab section.

Foundations of Machine Learning (M1 ) :

It is essentially the intersection between statistics and computation, though the principles of machine learning have been rediscovered from many different traditions, including artificial intelligence, Bayesian statistics, and frequently statistics. This course gives an overview of the most important trends in machine learning, with a particular focus on statistical risk and its minimization with respect to a prediction function. A substantial lab section involves group projects on data science competitions and gives students the ability to apply the course theory to real-world problems.

Foundations of Signal Processing & Sparse Coding (M1) :

This class will introduce the mathematical concepts and techniques to achieve a solid understanding of the fundamental principles of linear signal processing, as well as recent research on nonlinear signal processing, with a focus on sparse coding. Starting with the fundamentals of linear signal processing, we will see how the main notions of Fourier transforms can be understood in terms of a change of basis, and use this intuition to present both continuous- and discrete- time signal processing. Moving on from the harmonic basis we will then cover the basics of over-complete bases, time-frequency analysis and wavelets. This will lead us to techniques developed around sparse coding with overcomplete dictionaries, involving optimization with sparsity-inducing norms & dictionary learning.

Foundations of Discrete Optimization (M1) :

Discrete optimization is concerned with the subset of optimization problems where some or all of the variables are confined to take a value from a discrete set. In this course, we will study the fundamental concepts of discrete optimization such as greedy algorithms, dynamic programming and min-max relationships. Each concept will be illustrated using well-known problems such as shortest paths, minimum spanning tree, min-cut, max-flow and bipartite matching. We will also identify which problems are easy and which problems are hard, and briefly discuss how to obtain an approximate solution to hard problems.

Foundations of Neural Information Processing :

Neural information processing is the study of computational systems for data understanding. It covers a range of techniques including statistical learning theory, information theory, graphical models, and non-linear and discrete optimization, as well as their application to important prediction problems facing science and industry. Summarizing some of the major results of the machine learning research community of the past few decades, as well as their interrelationships, this course covers fundamental techniques that can be applied to a wide variety of real-world problems.

Foundations of Geometric Methods in Data Analysis :

Data analysis is the process of cleaning, transforming, modeling or comparing data, in order to infer useful information and gain insights into complex phenomena. From a geometric perspective, when an instance (a physical phenomenon, an individual, etc.) is given as a fixed-sized collection of real-valued observations, it is naturally indentified with a geometric point having these observations as coordinates. This course reviews fundamental constructions related to the manipulation of such point clouds, mixing ideas from computational geometry and topology, statistics, and machine learning. The emphasis is on methods that not only come with theoretical guarantees, but also work well in practice. In particular, software references and example datasets will be provided to illustrate the constructions.

Foundations of Polyhedral Combinatorial Optimization :

Polyhedral techniques have emerged as one of the most powerful tools to analyse and solve combinatorial optimization problems. Broadly speaking, combinatorial optimization problems can be formulated as integer linear programs.In this course, we will study the fundamental concepts of polyhedral techniques such as totally unimodular matrices, matroids and submodular functions. Each concept will be illustrated using well-known problems such as bipartite matching, min-cut, max-flow and minimum spanning tree. The course is divided into two parts. In the first part, we will study easy problems (those that admit efficient optimal algorithms). We will use polyhedral techniques to explain why these problems are easy. In the second part, we will study hard problems (specifically, NP-hard problems). We will use polyhedral techniques to obtain provably accurate approximate solutions for various hard problems.

Foundations of Large Scale & Distributed Optimization :

In a wide range of application fields (inverse problems, machine learning, computer vision, data analysis, networking...), large scale optimization problems need to be solved. The objective of this course is to introduce the theoretical background which makes it possible to develop efficient algorithms to successfully address these problems by taking advantage of modern multicore or distributed computing archtectures. This course will be mainly focused on nonlinear optimization tools for dealing with convex problems. Proximal tools, splitting techniques and Majorization-Minimization strategies which are now very popular for processing massive datasets will be presented. Illustrations of these methods on various applicative examples will be provided.

Débouchés :

Les domaines d’application liés à la science des données sont très larges :

  • secteur numérique
  • santé et biotechnologies
  • finance
  • marketing
  • robotique
  • assurance

Contact :

Guillemette Breysse

Coordinatrice