INFO2049-1 : Web and Text Analytics

Study Programmes 2015-2016

INFO2049-1

Web and Text Analytics

Duration :

30h Th

Number of credits :

Master in computer science and engineering (120 ECTS)		5
Master in computer science and engineering (120 ECTS)		5
Master in computer science (120 ECTS)		5
Master in computer science (120 ECTS)		5
Master in management (120 ECTS)		5
Master in management (120 ECTS)		5
Master in business engineering (120 ECTS)		5
Master in business engineering (120 ECTS)		5

Lecturer :

Ashwin Ittoo

Language(s) of instruction :

English language

Organisation and examination :

Teaching in the second semester

Units courses prerequisite and corequisite :

Prerequisite or corequisite units are presented within each program

Course contents :

In recent years, we have witnessed the proliferation of data in text format. Typical examples include messages on social networks like Facebooks, tweets on Twitter about important topics and events (such as political campaigns and elections) and users opinions on products, brands and companies available from review sites (such as Amazon).
Buried within these huge volumes of texts are meaningful information nuggets, which if detected and extracted can be exploited to support a wide range of activities, especially Business Intelligence. For instance, business organizations use the opinion of users on products from review sites to improve their brand images, products and services.
However, the challenge lies in how to automatically process these huge text collections in order to detect and extract meaningful information from their contents?
Objective
This course is intended to overcome the aforementioned challenge in text processing and analysis. To this aim, it will introduce students to key techniques/algorithms in text/data mining and machine learning. Some topics of natural language processing will also be covered.
This course has both a strong theoretical and practical components. The practicals will be done mostly using R. The NLTK Python or Weka Java toolkits may also be used depending on the task at hand.
Structure 1. Text Data Represenation

Vector space model
Term-document/document-term matrices
Measuring text similarity (Levenshtein distance, Cosine Similarity)

2. Regular Expressions (RegEx)

Introduction to basic operators
Application for data cleaning

3. Basic Text Pre-processing

Stemming
Lemmatization
Stemming vs. Lemmatization

4. Feature Selection

Term Frequency-Inverse Document Frequency
Other statistical metrics (e.g. Chi-squared, log-likelihood, mutual information)
Notion of n-grams

One topic from the list below, subject to time constraints
A. Introduction to Opinion Mining and Sentiment Analysis
B. Part-of-Speech Tagging and Syntactic Parsing
C. Text Classification
Practial Sessions
In the practical, students will be taught how to perform basic text mining tasks programmatically, including:

Crawling and downloading text data
Streaming tweets using the Twitter API
Finding more relevant words/topics from text data, tweets
Generating and visualizing statistics from the data (e.g. histograms)

Sample codes will be provided by the professor and demos will be shown during the lectures.
Furthermore, the students are expect to work on a practical project, which will count for ~30-40% of the final grade. These projects will be comprehensive in the sense that they will encompass many of the different aspects taught in the lectures/practicals. Sample project topics include text classification or opinion mining. Projects will be executed in groups of 3 students.

Learning outcomes of the course :

Understand the differences between traditional structured data and unstructured text data.
Appreciate that analyzing text data requires specific techniques to convert the data into a format amenable to processing
Understand the use of web crawlers
Justify the use of different types of crawlers for different situations/applications
Understand the ethical, moral, legal issues associated with web crawling
Understand the Vector Space Model for text representation
Appreciate the need for text cleaning and pre-processing techniques
Apply text cleaning and pre-processing technique on a set of text documents
Apply the cosine similarity measure to estimate the similarity between text documents

Explain how general search engines work and their architecture
Understand the terms opinion mining, sentiment analysis and subjectivity analysis
Appreciate the need for document level, sentence level and feature level sentiment analysis
Develop (and implement) basic algorithms for opinion mining
Discuss recent trends in opinion mining research
Understand the terms prestige and centrality in social networks
Appreciate the strengths and weaknesses of different techniques
Acquire hands-on experience in developing and implementing basic web and text analytics applications

Prerequisite knowledge and skills :

Any course in graduate level mathematics (vector algebra, probability/statistics)
Programming experiene
Support will be offered to students

Lecture notes
Sample programs
Online references
Manual for software

Planned learning activities and teaching methods :

The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
The course will be organized over 10 sessions of 3 hours

6 (or 7) will be "theory lectures"
4 (or 3) will be practical sessions. Since there are no practical rooms reserved for this course, practical will be performed in the lecture halls. More details in the Practical section in this document. The practical will eventually culminate into a project, counting for 30% of the grades (the weighting may change)

In terms of hours:

Theory lectures = 18-22 hours
Self-study for exam = approx. 70 hours
Practical lectures = 9-12 hours
Working on practical exercises and projects = approx. 80 hours
Total = 150 hours (5 credits)

Mode of delivery (face-to-face ; distance-learning) :

Lectures
Practical (during lectures and as homework)