Budapest University of Technology and Economics, Faculty of Electrical Engineering and Informatics

    címtáras azonosítással

    vissza a tantárgylistához   nyomtatható verzió    

    Media and Text Mining

    A tantárgy neve magyarul / Name of the subject in Hungarian: Média- és szövegbányászat

    Last updated: 2017. június 21.

    Budapest University of Technology and Economics
    Faculty of Electrical Engineering and Informatics

    Business Information Systems   
    Specialization Analytical Business Intelligence   
    MSc program

    Course ID Semester Assessment Credit Tantárgyfélév
    VITMM275   3/0/1/v 5  
    3. Course coordinator and department Dr. Szűcs Gábor,
    4. Instructors


     Name: Position: Department:
     Gábor Szűcs PhD associate professor  TMIT


    5. Required knowledge

    Calculus, Algebra, Probability theory



    6. Pre-requisites

    Theres is no obligatory pre-required subject.

    But we recommend Data Mining Techniques (BMEVISZM185) before this subject.


    7. Objectives, learning outcomes and obtained knowledge

    The course is concerned with introducing the students to the identification, assessment and analysis of the intelligent information search systems and multimedia retrieval systems. It also focuses on content handling techniques, where contents may either be text or media, or both.



    8. Synopsis
    1. Problems in Analytical Business Intelligence by multinational companies. Metadata systems and standards: DC, RDF, MPEG-7.
    2. Typical task types in Media and Text Mining. Search, classification, clustering, forecasting and their combinations.
    3. Methods for media and text analysis, search techniques, indexing, ranking procedures. Bag of words model.
    4. Searching on he Web, Web Mining. PageRank, webgraph methods, HITS, Boole-search, weighting schemes (tf-idf, etc.), cosines distance.
    5. Dimension reduction methods, feature extraction and feature selection techniques, chi-square, eigenvalue based methods, independent component analysis (ICA).
    6. Classification of pictures, videos. Discretization. Types and methods of media classification. Support vector machine for media classification.
    7. Text analysis. Stemming algorithms, Porter stemmer, Lovins stemmer. Language detection, language dependency. Shallow and deep parsing. POS tagging. Syntax tree parsers, dependency graph parser. Stanford tools.
    8. Text classification. Types and methods of text classification. Gini index. C4.5, C5.0, Random Forest. Automatic text processing at enterprises.
    9. Text and media clustering. Various distance measure. Agglomerative and divisive clustering. Hierarchical clustering (bottom-up and top-down), k-means clustering, density-based clustering.
    10. Relation extraction from text. Co-occurrence, pattern-matching and supervised learning methods. Convolution kernels with SVM in relation extraction. Gathering business news, information extraction from the news.
    11. Hierarchical taxonomy systems, Catalogue search, thesaurus. Folksonomy, methods for multiusers. Concept mining. Annotation. Sentiment analysis.
    12. Context-Based Image Retrieval. Line detection, skeletonization. Image and time series in multimedia.
    13. Media-indexing. Probability models in video and audio searches. Applications of Hidden Markov Models.
    14. Developing media retrieval and search systems in enterprises. Marketing applications, online media applications.


    1. The tasks should be solved by data mining and text mining softwares (e.g. SAS software modules)
    2. Searching techniques in a predefined corpus.
    3. Media classification exercises.
    4. Picture clustering exercises.
    5. Text analysis.
    6. Text categorization.
    7. Context-Based Image Retrieval with a large set of pictures.
    8. Lift diagram analysis


    9. Method of instruction

    lecture and laboratory


    10. Assessment

    a. In the class period there is an in-class test (ZH).

    b. In the examination period: a homework should be written and this work should be defended at the examination (oral). Another part of the examination is written.

    c. Condition for the signature is the pass mark of ZH test (40% above). There is a possibility to rewrite the in-class test (ZH). In the rectification period (repeat period) there is another (final) possibility to rewrite the in-class test (ZH).

    d. Another condition for the signature is at least 5 attendances the laboratory exercises.


    11. Recaps

    There is one possibility to repeat the test in the teaching period and there is a final one in the official recap period. There is no possibility to make up for the missed laboratory exercises. Condition for the signature is the pass of one of the tests and at least 5 successful laboratory exercises.



    12. Consultations

    Consultation with the lecturers of the subject is possible at pre-arranged time.



    13. References, textbooks and resources
    1. Blanken, de Vries, Blok, Fres (eds): Multimedia Retrieval. Springer, 2007.
    2. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: Introduction to Information Retrieval. Cambridge University Press, 2008
    3. Ronen Feldman, James Sanger: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data, Cambridge University Press, 2007


    14. Required learning hours and assignment
    Lessons 56 
    Preparation for lessons (10 for lectures and 8 for laboratories) 18
    Preparation for test 20
    Home work 16
    Learning of prepared matters  0
    Preparation for exam 40
    15. Syllabus prepared by

     Name: Position: Department:
     Zsolt T. Kardkovács PhD assistant professor  TMIT
     Gábor Szűcs PhD associate professor  TMIT
     Domonkos Tikk PhD senior research fellow  TMIT