 |  |  |
| INFO2049-1 | Web and Text Analytics
|

 |
| Duration : | 15h Th, 5h Pr, 90h Proj. |
 |
| Number of credits : |
| Master of science in computer science and engineering, research focus, 1st year |  | 5 |
 |
| Master of science in computer science and engineering, research focus, 2nd year |  | 5 |
 |
| Master in Computer science, Research Focus, 1st year |  | 5 |
 |
| Master in Computer science, Research Focus, 2nd year |  | 5 |
 |
| Master of science in computer science and engineering, professional focus in management, 1st year |  | 5 |
 |
| Master in Computer Science, Professional Focus (Management), 1st year |  | 5 |
 |
| Master en sciences de gestion à finalité spécialisée en digital marketing and sales management, 2nd year |  | 5 |
 |
|
 |
| Lecturer : | Ashwin Ittoo |
 |
Language(s) of instruction :
 |
| English language |
 |
Organisation and examination :
 |
| Teaching in the second semester |
 |
Course contents :
 |
| In recent years, we have witnessed the proliferation of data in text format. This proliferation has been substantially fuelled by the emergence of social media networks and online communities, which enable users to express themselves using natural language texts. Typical examples include messages on social networks like Facebooks, tweets on Twitter about important topics and events (such as political campaigns and elections) and users opinions on products, brands and companies available from review sites (such as Amazon). It is also worth mentioning that unstructured data (text, image, video) constitute of the "Vs" that are commonly used to describe Big Data, namely Variety. (In this course, however, our focus will be on text).
Buried within these huge volumes of texts are meaningful information nuggets, which if detected and extracted can be exploited to support a wide range of activities, especially Business Intelligence. For instance, business organizations use the opinion of users on products from review sites to improve their brand images, products and services.
However, the challenge lies in how to automatically process these huge text collections in order to detect and extract meaningful information from their contents?
Objective
This course is intended to overcome the aforementioned challenge in text processing and analysis.
Structure
The course will be organized over 10 sessions of 3 hours
- 6 (or 7) will be "theory lectures"
- 4 (or 3) will be practical session
1. Introduction to Web Crawling
- Crawlers and Search Engines
- Types of crawlers (universal, topical)
- Ethics, issues/concerns with crawlers
2. Text Data Represenation
- Comparison with traditional structured data
- Vector space representation
3. Data Cleaning and Pre-Processing
- Classical techniques for data cleaning, normalizing processes (numerical data)
- Text data cleaning and pre-processing (featue selection using tf-df)
4. Information Retrieval (Search Engines)
- Basic principles of search engines
- Cosine similarity measure between queries and documents
- Google PageRank
5. Opinion Mining and Sentiment Analysis
- Sentiment analysis, opinion mining, subjectivity analysis overview
- Document level, sentence level, feature level analysis
- Lexicons
6. Social Network Analysis
- Defining and formalizing social networks
- Centrality and prestige
- Techniques for social network analysis
Practial Sessions
To be done in teams of at most 3 students.
|
 |
Learning outcomes of the course :
 |
|
- Understand the differences between traditional structured data and unstructured text data.
- Appreciate that analyzing text data requires specific techniques to convert the data into a format amenable to processing
- Understand the use of web crawlers
- Justify the use of different types of crawlers for different situations/applications
- Understand the ethical, moral, legal issues associated with web crawling
- Understand the Vector Space Model for text representation
- Appreciate the need for text cleaning and pre-processing techniques
- Apply text cleaning and pre-processing technique on a set of text documents
- Apply the cosine similarity measure to estimate the similarity between text documents
- Explain how general search engines work and their architecture
- Understand the terms opinion mining, sentiment analysis and subjectivity analysis
- Appreciate the need for document level, sentence level and feature level sentiment analysis
- Develop (and implement) basic algorithms for opinion mining
- Discuss recent trends in opinion mining research
- Understand the terms prestige and centrality in social networks
- Appreciate the strengths and weaknesses of different techniques
- Acquire hands-on experience in developing and implementing basic web and text analytics applications
|
 |
Prerequisites and co-requisites/ Recommended optional programme components :
 |
| Any course in graduate level mathematics (vector algebra, probability/statistics)
Programming experiene
Support will be offered to students
- Lecture notes
- Sample programs
- Online references
- Manual for software
|
 |
Planned learning activities and teaching methods :
 |
| The course carries 5 credits and therefore requires 150 hours of work (1 credit = 30 hours).
The course will be organized over 10 sessions of 3 hours
- 6 (or 7) will be "theory lectures"
- 4 (or 3) will be practical sessions. Since there are no practical rooms reserved for this course, practical will be performed in the lecture halls. More details in the Practical section in this document. The practical will eventually culminate into a project, counting for 30% of the grades (the weighting may change)
In terms of hours:
- Theory lectures = 18-22 hours
- Self-study for exam = approx. 70 hours
- Practical lectures = 9-12 hours
- Working on practical exercises and projects = approx. 80 hours
- Total = 150 hours (5 credits)
|
 |
Mode of delivery (face-to-face ; distance-learning) :
 |
|
- Lectures
- Practical (during lectures and as homework)
|
 |
Recommended or required readings :
 |
| Textbook for the course
- Web Data Mining, Exploring Hyperlinks, Contents, and Usage Data by Liu Bing, 2nd Edition, 2011
|
 |
Assessment methods and criteria :
 |
| Final written exam: 55%
Final practical project: 35%
Practical Exercise: 10% |
 |
Work placement(s) :
 |
| |
 |
Organizational remarks :
 |
| |
 |
Contacts :
 |
| Dr Ashwin Ittoo
ashwin.ittoo@ulg.ac.be |
 |

 |
| Items online : |
|
| http://lola.hec.ulg.ac.be/ |
| http://lola.hec.ulg.ac.be/ |
|
|

|
|  |