INFO2049-1 : Web and Text Analytics

University of Liege | Version française

Academic year 2014-2015

Value date : 12/05/2015

INFO2049-1

Web and Text Analytics

Duration :

15h Th, 5h Pr, 90h Proj.

Number of credits :

Master of science in computer science and engineering, research focus, 1st year		5

Master of science in computer science and engineering, research focus, 2nd year		5

Master in Computer science, Research Focus, 1st year		5

Master in Computer science, Research Focus, 2nd year		5

Master of science in computer science and engineering, professional focus in management, 1st year		5

Master in Computer Science, Professional Focus (Management), 1st year		5

Master en sciences de gestion à finalité spécialisée en digital marketing and sales management, 2nd year		5

Lecturer :

Ashwin Ittoo

Language(s) of instruction :

English language

Organisation and examination :

Teaching in the second semester

Course contents :

In recent years, we have witnessed the proliferation of data in text format. This proliferation has been substantially fuelled by the emergence of social media networks and online communities, which enable users to express themselves using natural language texts. Typical examples include messages on social networks like Facebooks, tweets on Twitter about important topics and events (such as political campaigns and elections) and users opinions on products, brands and companies available from review sites (such as Amazon). It is also worth mentioning that unstructured data (text, image, video) constitute of the "Vs" that are commonly used to describe Big Data, namely Variety. (In this course, however, our focus will be on text).
Buried within these huge volumes of texts are meaningful information nuggets, which if detected and extracted can be exploited to support a wide range of activities, especially Business Intelligence. For instance, business organizations use the opinion of users on products from review sites to improve their brand images, products and services.
However, the challenge lies in how to automatically process these huge text collections in order to detect and extract meaningful information from their contents?
Objective
This course is intended to overcome the aforementioned challenge in text processing and analysis.
Structure
The course will be organized over 10 sessions of 3 hours

6 (or 7) will be "theory lectures"
4 (or 3) will be practical session

1. Introduction to Web Crawling

Crawlers and Search Engines
Types of crawlers (universal, topical)
Ethics, issues/concerns with crawlers

2. Text Data Represenation

Comparison with traditional structured data
Vector space representation

3. Data Cleaning and Pre-Processing

Classical techniques for data cleaning, normalizing processes (numerical data)
Text data cleaning and pre-processing (featue selection using tf-df)

4. Information Retrieval (Search Engines)

Basic principles of search engines
Cosine similarity measure between queries and documents
Google PageRank

5. Opinion Mining and Sentiment Analysis

Sentiment analysis, opinion mining, subjectivity analysis overview
Document level, sentence level, feature level analysis
Lexicons

6. Social Network Analysis

Defining and formalizing social networks
Centrality and prestige
Techniques for social network analysis

Practial Sessions
To be done in teams of at most 3 students.

Vector Space Model and Representation using R
Windows version: http://cran.r-project.org/bin/windows/base/ (also install the TM package, instructions to be given in the lecture)
Text cleaning and pre-processing using R
Lemmatization and parts-of-speech tagging using the Stanford NLP Tools (http://nlp.stanford.edu/software/)
Basic opinion mining application (final project)

Learning outcomes of the course :

Understand the differences between traditional structured data and unstructured text data.
Appreciate that analyzing text data requires specific techniques to convert the data into a format amenable to processing
Understand the use of web crawlers
Justify the use of different types of crawlers for different situations/applications
Understand the ethical, moral, legal issues associated with web crawling
Understand the Vector Space Model for text representation
Appreciate the need for text cleaning and pre-processing techniques
Apply text cleaning and pre-processing technique on a set of text documents
Apply the cosine similarity measure to estimate the similarity between text documents

Explain how general search engines work and their architecture
Understand the terms opinion mining, sentiment analysis and subjectivity analysis
Appreciate the need for document level, sentence level and feature level sentiment analysis
Develop (and implement) basic algorithms for opinion mining
Discuss recent trends in opinion mining research
Understand the terms prestige and centrality in social networks
Appreciate the strengths and weaknesses of different techniques
Acquire hands-on experience in developing and implementing basic web and text analytics applications

Prerequisites and co-requisites/ Recommended optional programme components :

Any course in graduate level mathematics (vector algebra, probability/statistics)
Programming experiene
Support will be offered to students

Lecture notes
Sample programs
Online references
Manual for software

Planned learning activities and teaching methods :

The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
The course will be organized over 10 sessions of 3 hours

6 (or 7) will be "theory lectures"
4 (or 3) will be practical sessions. Since there are no practical rooms reserved for this course, practical will be performed in the lecture halls. More details in the Practical section in this document. The practical will eventually culminate into a project, counting for 30% of the grades (the weighting may change)

In terms of hours:

Theory lectures = 18-22 hours
Self-study for exam = approx. 70 hours
Practical lectures = 9-12 hours
Working on practical exercises and projects = approx. 80 hours
Total = 150 hours (5 credits)

Mode of delivery (face-to-face ; distance-learning) :

Lectures
Practical (during lectures and as homework)