2021-2022 / INFO2049-1

Web and Text Analytics


30h Th

Number of credits

 Master of Science (MSc) in Data Science5 crédits 
 Master of Science (MSc) in Computer Science and Engineering5 crédits 
 Master of Science (MSc) in Computer Science and Engineering (double diplômation avec HEC)5 crédits 
 Master of Science (MSc) in Data Science and Engineering5 crédits 
 Master of Science (MSc) in Computer Science5 crédits 
 Master of Science (MSc) in Computer Science (double diplômation avec HEC)5 crédits 
 Master in business engineering (120 ECTS)5 crédits 
 Master in business engineering (120 ECTS) (Digital Business)5 crédits 
 Master in linguistics (120 ECTS) (Double diplomation)5 crédits 


Ashwin Ittoo

Language(s) of instruction

English language

Organisation and examination

Teaching in the first semester, review in January


Schedule online

Units courses prerequisite and corequisite

Prerequisite or corequisite units are presented within each program

Learning unit contents

This course covers basic and advanced algorithms and techniques in natural language processing (NLP) and text mining/analytics.The algorithms covered will have a machine learning underpinning.
The topics will include those dealing with general machine learning (e.g. decision trees, Bayesian learning, Markov Chains, neural networks) as well as those specific to NLP (e.g. language modelling, neural language models)
Deep learning techniques for NLP will also be covered, including RNN, Seq2Sep, Transformer Models and RoBERTa (from Facebook Research)
Besides the theoretical aspects, the application of these techniques in applications such as Machine Translation and Chatbots will be discussed.
This course has both a strong theoretical and practical components. The practicals will be done mostly using Python (pytorch, scikit-learn). Participants can use other languages with which they feel more comfortable (e.g. C, C++, C#, Java) 
Students will also be asked to present recent scientific articles in Deep Learning & Deep Learning for NLP. 

Take this course only if you have very good mathematical and programming skills

Course Structure
1. Introduction

  • Introduction to machine learning
  • Decision tree classifiers
  2. Vector Space Model and Information Retrieval 
  • Representing words as vectors
  • Measuring text similarity (Levenshtein distance, Cosine Similarity)
  3. Feature Selection 
  • Tf-idf (Term Frequency-Inverse Document Frequency) 
  • Chi-squared measure
  • Mutual information 
  4. Naïve-Bayes for  Text Classification 
  • Bayesian theory revision
  • Multinomial vs. Bernoulli Naïve-Bayes
  • Parameter estimation
  5. Evaluating Models
  • Bootstrapping, cross-validation
  • Metrics: precision, recall, F-score
  • Metrics (Machine Translation): BLEU
  • Spearmann Rank correlation, Wilcoxon test (if time permits)
  6. Language Models 
  • Markov models
  • n-gram (tri-gram) models
  • Parameter estimation
  • Perplexity metric
  • Discounting methods and Katz Back-off 
  7. Revision: Artificial Neural Networks (ANN)
  • ANN architecture
  • Activation functions
  • Stochastic Gradient Descent 
  8. Neural Network Language Models (Neural language models)
  • Distributational semantics and distributed word representation
  • Word2Vec (see Distributed Word Representation, Mikolov et al., 2013)
9. Deep Learning for NLP 
  • Introducing Recurrent Networks for NLP
10. Machine Translation (Statistical & Deep Learning)
  • Methods for Statistical Machine Translation
  • Seq2Seq Model for Machine Translation
  • Evaluation
11. Language Models (part ii)
  • Transformer Model
  • RoBERTa Model
12. GANs & Chatbots (if time permits)
  • Overview: GANs for language generation
  Practial Sessions Course participants are expect to work on a practical project, which will count for ~30-40% of the final grade. These projects will be comprehensive in the sense that they will encompass many of the different aspects taught in the lectures/practicals. Sample project topics include text classification, opinion mining, machine translation, language generation. Projects will be executed in groups of 3 students.

Students are expected to have programming skills and implement the projects on their own.

Learning outcomes of the learning unit

  • Understand the underlying principles and algebraic formulations of machine learning models
  • Ability to apply these models to the task of information extraction from text and text classification
  • Synthesize various principles and algorithms introduced in the course and to develop a full-fledge text analytics application (as part of the course project)
  • Implement text analytics solutions to support an organization's business intelligence activities
  • Formulate a strategy based on the acquired text analytics skills to optimize the value of an organization
  • Ability to perform research on and understand advanced topics in the field and to be informed on recent developments to adapt easily to changing requirements 
  • Appreciate how the algorithms studied could solve real-life managerial issues
  • Communicate appropriately about text analytics projects/applications to various stakeholders

Prerequisite knowledge and skills

Students are expected to have reasonable maths/stats & programming skills. Appropriate guiance and support will be offered to students

  • Lecture notes
  • Online references
  • Consultation (if time permits)

Planned learning activities and teaching methods

The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
Theory lectures = 18-22 hours

  • Self-study for exam = approx. 70 hours
  • Practical lectures = 9-12 hours
  • Working on practical exercises and projects = approx. 80 hours
  • Total = 150 hours (5 credits)

Mode of delivery (face to face, distance learning, hybrid learning)

  • Lectures 
  • Practical  (during lectures and as homework)

Recommended or required readings

Topics covered in the course will come from different textbooks
Deep Learning Textbook (2016) by Goodfellow, Bengio, Courville (Softcopy here.)
Speech and Language Processing (2017) by Jurafsky and Martin (Softcopy here)
Neural Network Methods for Natural Language Processing (2017) by Goldberg (Softcopy here)

Assessment methods and criteria

Exam(s) in session

Any session

- In-person

written exam ( open-ended questions )

- Remote

oral exam

Written work / report

Continuous assessment

Additional information:

Final written exam: 50%
Final practical project: 35%
Paper presentation:  15%
(May be adjusted during the course)

Work placement(s)

Organizational remarks


Ashwin Ittoo, ashwin.ittoo@uliege.be

Items online

Lecture Notes
Lecture Notes