Duration
30h Th
Number of credits
Lecturer
Language(s) of instruction
English language
Organisation and examination
Teaching in the first semester, review in January
Schedule
Units courses prerequisite and corequisite
Prerequisite or corequisite units are presented within each program
Learning unit contents
Objective
This course covers basic and advanced algorithms and techniques in natural language processing (NLP) and text mining/analytics.The algorithms covered will have a machine learning underpinning.
The topics will include those dealing with general machine learning (e.g. decision trees, Bayesian learning, Markov Chains, neural networks) as well as those specific to NLP (e.g. language modelling, neural language models)
Deep learning techniques for NLP will also be covered, including RNN, Seq2Sep, Transformer Models and RoBERTa (from Facebook Research)
This course has both a strong theoretical and practical components. The practicals will be done mostly using R and scikit-learn (Python). Participants can use other languages with which they feel more comfortable (e.g. C, C++, C#, Java)
Students will also be asked to read and discuss recent scientific articles in Deep Learning & Deep Learning for NLP.
Take this course only if you have very good mathematical and programming skills
Course Structure
1. Introduction
- Introduction to machine learning
- Decision tree classifiers
- Representing words as vectors
- Measuring text similarity (Levenshtein distance, Cosine Similarity)
- Tf-idf (Term Frequency-Inverse Document Frequency)
- Chi-squared measure
- Mutual information
- Bayesian theory revision
- Multinomial vs. Bernoulli Naïve-Bayes
- Parameter estimation
- Bootstrapping, cross-validation
- Metrics: precision, recall, F-score
- Metrics (Machine Translation): BLEU
- Spearmann Rank correlation, Wilcoxon test (if time permits)
- Markov models
- n-gram (tri-gram) models
- Parameter estimation
- Perplexity metric
- Discounting methods and Katz Back-off
- ANN architecture
- Activation functions
- Stochastic Gradient Descent
- Distributational semantics and distributed word representation
- Word2Vec (see Distributed Word Representation, Mikolov et al., 2013)
- Introducing Recurrent Networks for NLP
- Methods for Statistical Machine Translation
- Seq2Seq Model for Machine Translation
- Evaluation
- Transformer Model
- RoBERTa Model
Students are expected to have programming skills and implement the projects on their own.
Learning outcomes of the learning unit
- Understand the underlying principles and algebraic formulations of machine learning models
- Ability to apply these models to the task of information extraction from text and text classification
- Synthesize various principles and algorithms introduced in the course and to develop a full-fledge text analytics application (as part of the course project)
- Implement text analytics solutions to support an organization's business intelligence activities
- Formulate a strategy based on the acquired text analytics skills to optimize the value of an organization
- Ability to perform research on and understand advanced topics in the field and to be informed on recent developments to adapt easily to changing requirements
- Appreciate how the algorithms studied could solve real-life managerial issues
- Communicate appropriately about text analytics projects/applications to various stakeholders
Prerequisite knowledge and skills
It is very important for course participants to have a very good background in:
- Calculus (e.g. partial derivatives, chain rule)
- Vector/matrix algebra
- Statistical methods (e.g. probabilities, regressions)
- Programming
- Some knowledge of mathematical optimization
Support will be offered to students
- Lecture notes
- Online references
Planned learning activities and teaching methods
The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
Theory lectures = 18-22 hours
- Self-study for exam = approx. 70 hours
- Practical lectures = 9-12 hours
- Working on practical exercises and projects = approx. 80 hours
- Total = 150 hours (5 credits)
Mode of delivery (face-to-face ; distance-learning)
- Lectures
- Practical (during lectures and as homework)
Recommended or required readings
Topics covered in the course will come from different textbooks
Deep Learning Textbook (2016) by Goodfellow, Bengio, Courville (Softcopy here.)
Speech and Language Processing (2017) by Jurafsky and Martin (Softcopy here)
Neural Network Methods for Natural Language Processing (2017) by Goldberg (Softcopy here)
Assessment methods and criteria
Final written exam: 70%
Final practical project: 30%
(May be adjusted during the course)
Work placement(s)
Organizational remarks
Contacts
Ashwin Ittoo, ashwin.ittoo@uliege.be
Adaptation of teaching commitments following the COVID-19 pandemic for the May-June 2020 session
Teaching methods implemented : distance-learning
Assessment subjects
Assessment methods
Contacts
Adaptation of teaching commitments following the COVID-19 pandemic for the Aug-Sept 2020 session
Assessment subjects
Same as 1st session.
Assessment methods
Distance/Online.
Details will be communicated to concernend students well before the exams.
Contacts
ashwin.ittoo@uliege.be
Items online
Lecture Notes
Lecture Notes