Study Programmes 2016-2017
INFO2049-1  
Web and Text Analytics
Duration :
30h Th
Number of credits :
Master in computer science and engineering (120 ECTS)5
Master in computer science (120 ECTS)5
Master in management (120 ECTS)5
Master in business engineering (120 ECTS)5
Lecturer :
Ashwin Ittoo
Language(s) of instruction :
English language
Organisation and examination :
Teaching in the first semester, review in January
Units courses prerequisite and corequisite :
Prerequisite or corequisite units are presented within each program
Learning unit contents :
In recent years, we have witnessed the proliferation of data in text format. 
The challenge now lies in how to automatically process these huge text collections in order to detect and extract meaningful information from their contents?  For e.g. business organizations can extract user opinions from online reviews to improve product or service quality.
Objective
This course will  therefore introduce students to key techniques/algorithms in text/data mining and machine learning. Some topics of natural language processing will also be covered.
This course has both a strong theoretical and practical components. The practicals will be done mostly using R and scikit-learn (Python). 
Structure
1. Vector Space Model and Information Retrieval 
  • Vector space model
  • Measuring text similarity (Levenshtein distance, Cosine Similarity)
  • Information Retrieval (inverted index, ranked and boolean searches)
 
2. Basic Text Pre-processing 
  • Bag-of-words models
  • Stemming
  • Lemmatization
  • Part-of-speech tagging
 
3. Feature Selection 
  • Tf-idf (Term Frequency-Inverse Document Frequency) 
  • Chi-squared measure
  • Mutual information
  • Notion of n-grams
 
4. Decision Trees for  Text Classification 
  • Decision trees revision 
  • Information Gain and Entropy for constructing trees
  • Overfitting and other issues
  • Pruning
  • Application for text classification
 
5. Naïve-Bayes for  Text Classification 
  • Bayesian theory revision
  • Multinomial vs. Bernoulli Naïve-Bayes
  • Naïve assumptions of Näive-Bayes
  • Parameter estimation
  • Strenghts and weaknesses 
 
6.Support Vector Machines (SVM) for  Text Classification 
  • Linear classifiers
  • Notion of vectors, planes, margins and support vectors
  • Algebraic formulation 
  • Non-linear cases and  kernel transformations
  • Strenghts and weaknesses
 
7. Language Models
  • Markov models
  • n-gram (tri-gram) models
  • Parameter estimation
  • Perplexity metric
  • Discounting methods and Katz Back-off 
  8. Conditional Random Fields (CRF) for Sequence Data
  • Problem of sequence data labeling
  • Generative vs. discriminative models
  • CRF vs. Markov model
  • Linear chains CRF
  • Algebraic formulation and parameter estimation (MLE, Stochastic Gradient Descent)
 
9. Introduction to Neural Language Models and Deep Learning 
  • Introduction to Neural Networks
  • Deep Learning architectures and application to Machine Translation
  • Word2Vec; Skpgram and Continous Bag-of-Words
 
Depending on available time (and overalll class progress), we may be able to only cover 1 or 2 chapters of chapters 7, 8, 9  
Practial Sessions In the practical, students will be taught how to perform basic text mining tasks programmatically, including:  
  • Crawling and downloading text data
  • Streaming tweets using the Twitter API
  • Finding more relevant words/topics from text data, tweets
  • Generating and visualizing statistics from  the data (e.g. histograms)
Sample codes will be provided by the professor and demos will be shown during the lectures.
Furthermore, the students are expect to work on a practical project, which will count for ~30-40% of the final grade. These projects will be comprehensive in the sense that they will encompass many of the different aspects taught in the lectures/practicals. Sample project topics include text classification or opinion mining. Projects will be executed in groups of 3 students.
Learning outcomes of the learning unit :
  • Understand the underlying principles and algebraic formulations of machine learning models
  • Ability to apply these models to the task of information extraction from text and text classification
  • Synthesize various principles and algorithms introduced in the course and to develop a full-fledge text analytics application (as part of the course project)
  • Implement text analytics solutions to support an organization's business intelligence activities
  • Formulate a strategy based on the acquired text analytics skills to optimize the value of an organization
  • Ability to perform research on and understand advanced topics in the field and to be informed on recent developments to adapt easily to changing requirements 
  • Appreciate how the algorithms studied could solve real-life managerial issues
  • Communicate appropriately about text analytics projects/applications to various stakeholders
Prerequisite knowledge and skills :
Students should have a good background:


  • Mathematics/statistics (vector algebra, probability, numerical optimization)
  • Programming
Support will be offered to students


  • Lecture notes
  • Sample programs
  • Online references
  • Manual for software
Planned learning activities and teaching methods :
The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
Theory lectures = 18-22 hours


  • Self-study for exam = approx. 70 hours
  • Practical lectures = 9-12 hours
  • Working on practical exercises and projects = approx. 80 hours
  • Total = 150 hours (5 credits)
Mode of delivery (face-to-face ; distance-learning) :
  • Lectures 
  • Practical  (during lectures and as homework)
Recommended or required readings :
  • Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data by Liu Bing, 2013 Edition
  • Selected scientific articles (to be provided during the course)
Assessment methods and criteria :
Final written exam: 55%
Final practical project: 35%
Practical Exercise: 10%
Work placement(s) :
Organizational remarks :
Contacts :
Ashwin Ittoo
ashwin.ittoo@ulg.ac.be
Items online :
Lecture Notes
Lecture Notes