Study Programmes 2015-2016
INFO2049-1  
Web and Text Analytics
Duration :
30h Th
Number of credits :
Master in computer science and engineering (120 ECTS)5
Master in computer science and engineering (120 ECTS)5
Master in computer science (120 ECTS)5
Master in computer science (120 ECTS)5
Master in management (120 ECTS)5
Master in management (120 ECTS)5
Master in business engineering (120 ECTS)5
Master in business engineering (120 ECTS)5
Lecturer :
Ashwin Ittoo
Language(s) of instruction :
English language
Organisation and examination :
Teaching in the second semester
Units courses prerequisite and corequisite :
Prerequisite or corequisite units are presented within each program
Course contents :
In recent years, we have witnessed the proliferation of data in text format. Typical examples include messages on social networks like Facebooks, tweets on Twitter about important topics and events (such as political campaigns and elections) and users opinions on products, brands and companies available from review sites (such as Amazon). 
Buried within these huge volumes of texts are meaningful information nuggets, which if detected and extracted can be exploited to support a wide range of activities, especially Business Intelligence. For instance, business organizations use the opinion of users on products from review sites to improve their brand images, products and services.
However, the challenge lies in how to automatically process these huge text collections in order to detect and extract meaningful information from their contents?
Objective
This course is intended to overcome the aforementioned challenge in text processing and analysis. To this aim, it will introduce students to key techniques/algorithms in text/data mining and machine learning. Some topics of natural language processing will also be covered.
This course has both a strong theoretical and practical components. The practicals will be done mostly using R. The NLTK Python or Weka Java toolkits may also be used depending on the task at hand.
Structure 1. Text Data Represenation


  • Vector space model
  • Term-document/document-term matrices
  • Measuring text similarity (Levenshtein distance, Cosine Similarity)
2. Regular Expressions (RegEx)


  • Introduction to basic operators
  • Application for data cleaning
3. Basic Text Pre-processing 


  • Stemming
  • Lemmatization
  • Stemming vs. Lemmatization
4. Feature Selection


  • Term Frequency-Inverse Document Frequency
  • Other statistical metrics (e.g. Chi-squared, log-likelihood, mutual information)
  • Notion of n-grams
One topic from the list below, subject to time constraints
A. Introduction to Opinion Mining and Sentiment Analysis
B. Part-of-Speech Tagging and Syntactic Parsing
C. Text Classification
Practial Sessions
In the practical, students will be taught how to perform basic text mining tasks programmatically, including:


  • Crawling and downloading text data
  • Streaming tweets using the Twitter API
  • Finding more relevant words/topics from text data, tweets
  • Generating and visualizing statistics from  the data (e.g. histograms)
Sample codes will be provided by the professor and demos will be shown during the lectures.
Furthermore, the students are expect to work on a practical project, which will count for ~30-40% of the final grade. These projects will be comprehensive in the sense that they will encompass many of the different aspects taught in the lectures/practicals. Sample project topics include text classification or opinion mining. Projects will be executed in groups of 3 students.
Learning outcomes of the course :
  • Understand the differences between traditional structured data and unstructured text data.
  • Appreciate that analyzing text data requires specific techniques to convert the data into a format amenable to processing
  • Understand the use of web crawlers
  • Justify the use of different types of crawlers for different situations/applications
  • Understand the ethical, moral, legal issues associated with web crawling
  • Understand the Vector Space Model for text representation
  • Appreciate the need for text cleaning and pre-processing techniques
  • Apply text cleaning and pre-processing technique on a set of text documents
  • Apply the cosine similarity measure to estimate the similarity between text documents
  • Explain how general search engines work and their architecture
  • Understand the terms opinion mining, sentiment analysis and subjectivity analysis
  • Appreciate the need for document level, sentence level and feature level sentiment analysis
  • Develop (and implement) basic algorithms for opinion mining
  • Discuss recent trends in opinion mining research
  • Understand the terms prestige and centrality in social networks
  • Appreciate the strengths and weaknesses of different techniques
  • Acquire hands-on experience in developing and implementing basic web and text analytics applications
 
Prerequisite knowledge and skills :
Any course in graduate level mathematics (vector algebra, probability/statistics)
Programming experiene
Support will be offered to students
  • Lecture notes
  • Sample programs
  • Online references
  • Manual for software
Planned learning activities and teaching methods :
The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
The course will be organized over 10 sessions of 3 hours
  • 6 (or 7) will be "theory lectures"
  • 4 (or 3) will be practical sessions. Since there are no practical rooms reserved for this course, practical will be performed in the lecture halls. More details in the Practical section in this document. The practical will eventually culminate into a project, counting for 30% of the grades (the weighting may change)
In terms of hours:
  • Theory lectures = 18-22 hours
  • Self-study for exam = approx. 70 hours
  • Practical lectures = 9-12 hours
  • Working on practical exercises and projects = approx. 80 hours
  • Total = 150 hours (5 credits)
Mode of delivery (face-to-face ; distance-learning) :
  • Lectures 
  • Practical  (during lectures and as homework)
Recommended or required readings :
Textbook for the course
  • Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data by Liu Bing, 2013 Edition
Assessment methods and criteria :
Final written exam: 55%
Final practical project: 35%
Practical Exercise: 10%
Work placement(s) :
Organizational remarks :
Contacts :
Dr Ashwin Ittoo
ashwin.ittoo@ulg.ac.be
Items online :
Lecture Notes
Lecture Notes