2021-2022 / INFO8002-1

Large-scale data systems

Durée

25h Th, 10h Pr, 45h Proj.

Nombre de crédits

 Master en science des données, à finalité5 crédits 
 Master : ingénieur civil en informatique, à finalité5 crédits 
 Master : ingénieur civil en informatique, à finalité (double diplômation avec HEC)5 crédits 
 Master : ingénieur civil en science des données, à finalité5 crédits 
 Master en sciences informatiques, à finalité5 crédits 
 Master en sciences informatiques, à finalité (double diplômation avec HEC)5 crédits 

Enseignant

Gilles Louppe

Langue(s) de l'unité d'enseignement

Langue anglaise

Organisation et évaluation

Enseignement au premier quadrimestre, examen en janvier

Horaire

Horaire en ligne

Unités d'enseignement prérequises et corequises

Les unités prérequises ou corequises sont présentées au sein de chaque programme

Contenus de l'unité d'enseignement

In the modern data landscape, large-scale data systems have become a critical component of the data science analysis pipeline. They are of primary importance for the reliable storage, but also for the analysis, of the increasingly larger volumes of data encountered in web-based applications, cloud computing centres or in networks of connected objects. 
However, large-scale data systems remain notoriously difficult to build because they need to scale to hundreds or thousands of machines, they must be tolerant to crashes, they have to cope with concurrent execution and they need to ensure consistency of the data they store. 
In this context, the course will cover elements of systems for data science in a bottom-up fashion. We will first cover the foundational abstractions that are the core of distributed systems, including basic abstractions and system assumptions, reliable broadcast, shared memory and consensus. We will then study data computing systems that are built on top of those components, including MapReduce and computational graph systems (Spark). Similarly, we will study distributed storage systems, including distributed file systems, distributed key-value stores and blockchains. 
Topics to be covered (tentative and subject to change):

  • Data deluge
  • Basic distributed abstractions
  • Reliable broadcast
  • Shared memory
  • Consensus
  • Blockchain
  • Distributed hash tables
  • Cloud computing
  • Distributed file systems

Acquis d'apprentissage (objectifs d'apprentissage) de l'unité d'enseignement

At the end of the course, the student will have understood the core building blocks of reliable distributed systems. He/she will also have acquainted with industrial data systems and their inner workings. Finally, he/she will have developed critical thinking regarding the benefits and limitations of these systems in the context of data science needs. 

Ce cours contribue aux acquis d'apprentissage I.1, I.2, I.3, II.1, II.2, III.1, III.2, III.3, IV.1, VI.1, VI.2, VII.1, VII.2, VII.4, VII.5 du programme d'ingénieur civil en science des données.


Ce cours contribue aux acquis d'apprentissage I.1, I.2, II.1, II.2, III.1, III.2, III.3, IV.1, IV.3, VI.1, VI.2, VII.1, VII.2, VII.4, VII.5 du programme d'ingénieur civil en informatique.

Savoirs et compétences prérequis

Programming experience. Basic knowledge in computer networks.

Activités d'apprentissage prévues et méthodes d'enseignement

  • Theoretical lectures
  • Exercise sessions 
  • Reading assignment
  • Programming project (e.g., implement a simple data system).

Mode d'enseignement (présentiel, à distance, hybride)

Lectures will taught face-to-face. Projects will be carried out remotely.

Lectures recommandées ou obligatoires et notes de cours

Slides will be made publicly available on GitHub during the semester.
Part of the course will be based on "Introduction to Reliable and Secure Distributed Programming", Christian Cachin, Rachid Guerraoui, Luis Rodrigues, Springer. This book is recommended.

Modalités d'évaluation et critères

The evaluation is divided into the following units:

  • Oral exam (50%)
  • Reading assignment (10%)
  • Programming project (40%)
The reading assignment and the programming project are mandatory for presenting the exam. 

Stage(s)

Remarques organisationnelles

The website for the course is https://github.com/glouppe/info8002-large-scale-database-systems

Contacts

  • Teacher: Prof. Gilles Louppe (g.louppe@uliege.be)
  • Assistant: Joeri Hermans (joeri.hermans@uliege.be)