Media and Text Mining

A tantárgy neve magyarul / Name of the subject in Hungarian: Média- és szövegbányászat

Last updated: 2017. június 21.

Budapest University of Technology and Economics
Faculty of Electrical Engineering and Informatics

Business Information Systems   
Specialization Analytical Business Intelligence   
MSc program
   

Course ID Semester Assessment Credit Tantárgyfélév
VITMM275   3/0/1/v 5  
3. Course coordinator and department Dr. Szűcs Gábor,
4. Instructors

 

 Name: Position: Department:
 Gábor Szűcs PhD associate professor  TMIT

 

5. Required knowledge

Calculus, Algebra, Probability theory

 

 

6. Pre-requisites
Ajánlott:

Theres is no obligatory pre-required subject.

But we recommend Data Mining Techniques (BMEVISZM185) before this subject.

 

7. Objectives, learning outcomes and obtained knowledge

The course is concerned with introducing the students to the identification, assessment and analysis of the intelligent information search systems and multimedia retrieval systems. It also focuses on content handling techniques, where contents may either be text or media, or both.

 

 

8. Synopsis
  1. Problems in Analytical Business Intelligence by multinational companies. Metadata systems and standards: DC, RDF, MPEG-7.
  2. Typical task types in Media and Text Mining. Search, classification, clustering, forecasting and their combinations.
  3. Methods for media and text analysis, search techniques, indexing, ranking procedures. Bag of words model.
  4. Searching on he Web, Web Mining. PageRank, webgraph methods, HITS, Boole-search, weighting schemes (tf-idf, etc.), cosines distance.
  5. Dimension reduction methods, feature extraction and feature selection techniques, chi-square, eigenvalue based methods, independent component analysis (ICA).
  6. Classification of pictures, videos. Discretization. Types and methods of media classification. Support vector machine for media classification.
  7. Text analysis. Stemming algorithms, Porter stemmer, Lovins stemmer. Language detection, language dependency. Shallow and deep parsing. POS tagging. Syntax tree parsers, dependency graph parser. Stanford tools.
  8. Text classification. Types and methods of text classification. Gini index. C4.5, C5.0, Random Forest. Automatic text processing at enterprises.
  9. Text and media clustering. Various distance measure. Agglomerative and divisive clustering. Hierarchical clustering (bottom-up and top-down), k-means clustering, density-based clustering.
  10. Relation extraction from text. Co-occurrence, pattern-matching and supervised learning methods. Convolution kernels with SVM in relation extraction. Gathering business news, information extraction from the news.
  11. Hierarchical taxonomy systems, Catalogue search, thesaurus. Folksonomy, methods for multiusers. Concept mining. Annotation. Sentiment analysis.
  12. Context-Based Image Retrieval. Line detection, skeletonization. Image and time series in multimedia.
  13. Media-indexing. Probability models in video and audio searches. Applications of Hidden Markov Models.
  14. Developing media retrieval and search systems in enterprises. Marketing applications, online media applications.

Laboratory:

  1. The tasks should be solved by data mining and text mining softwares (e.g. SAS software modules)
  2. Searching techniques in a predefined corpus.
  3. Media classification exercises.
  4. Picture clustering exercises.
  5. Text analysis.
  6. Text categorization.
  7. Context-Based Image Retrieval with a large set of pictures.
  8. Lift diagram analysis

 

9. Method of instruction

lecture and laboratory

 

10. Assessment

a. In the class period there is an in-class test (ZH).


b. In the examination period: a homework should be written and this work should be defended at the examination (oral). Another part of the examination is written.


c. Condition for the signature is the pass mark of ZH test (40% above). There is a possibility to rewrite the in-class test (ZH). In the rectification period (repeat period) there is another (final) possibility to rewrite the in-class test (ZH).


d. Another condition for the signature is at least 5 attendances the laboratory exercises.

 

11. Recaps

There is one possibility to repeat the test in the teaching period and there is a final one in the official recap period. There is no possibility to make up for the missed laboratory exercises. Condition for the signature is the pass of one of the tests and at least 5 successful laboratory exercises.

 

 

12. Consultations

Consultation with the lecturers of the subject is possible at pre-arranged time.

 

 

13. References, textbooks and resources
  1. Blanken, de Vries, Blok, Fres (eds): Multimedia Retrieval. Springer, 2007.
  2. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008
  3. Ronen Feldman, James Sanger: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007

 

14. Required learning hours and assignment
Lessons 56 
Preparation for lessons (10 for lectures and 8 for laboratories) 18
Preparation for test 20
Home work 16
Learning of prepared matters  0
Preparation for exam 40
Total150
15. Syllabus prepared by

 Name: Position: Department:
 Zsolt T. Kardkovács PhD assistant professor  TMIT
 Gábor Szűcs PhD associate professor  TMIT
 Domonkos Tikk PhD senior research fellow  TMIT