Data Analysis

A tantárgy neve magyarul / Name of the subject in Hungarian: Adatelemzés

Last updated: 2017. június 23.

Budapest University of Technology and Economics
Faculty of Electrical Engineering and Informatics

Budapest University of Technology and Economics, Faculty of Electrical Engineering and Informatics

B.Sc. in Engineering Information Technology

Course ID Semester Assessment Credit Tantárgyfélév
VISZAC00 6 2/1/0/v 4  
3. Course coordinator and department Dr. Csima Judit,
4. Instructors

László Ketskeméty associate professor Department of Computer Science and Information Theory
Bálint Daróczy PhD MTA SZTAKI
Gábor Szűcs associate professor
  Department of Telecommunications and Media Informatics

5. Required knowledge

Algorithms

 

6. Pre-requisites
Kötelező:
(Szakirany("AMINvallinfrendETT", _) VAGY
Szakirany("AMINvallinfrendSZIT", _) VAGY
Szakirany("AMINvallinfrendTMIT", _) )

VAGY Training.code=("5NAA8")

A fenti forma a Neptun sajátja, ezen technikai okokból nem változtattunk.

A kötelező előtanulmányi rend az adott szak honlapján és képzési programjában található.

7. Objectives, learning outcomes and obtained knowledge

The aim of the lectures is learning fundamental methods of the statistics and business data mining for master students during the semester.  On the practises several application examples that have arisen from various real problem analyzing and problem solving with data-intensive computing support is presented Gained skills and abilities: Students will be to realise the problems  solvable by business intelligence, will be able to solve the problems of statistical and data mining tools in high level in the corporate sector skills will be used, will be capable of accessing the corporate customer's data and integrated with other corporate data to plan and achieve of profit-oriented analytical solutions.

8. Synopsis

Schedule of lectures

 

  1. Basic concepts of theory of hypothesis: null- and alternative-hypetheses, test statistic, acceptance- and critical regions, errors (type 1 and 2), power-function, level of significance, strength of the test, unbiasedness, consistency. The distributions origined from the normal distribution:  chi-squared, F-, t-. Lukacs’s Theorem. Parameters, parametric tests.
  2. Parametric tests for the parameters of the normal distribution: one-sample t- and u-tests, two independent sample u- and t-tests, paired two-tailed t-test, F-test, Welch's t-test, Bartlett test.
  3. Non-parametric tests I. Fundamental Theorem of the Chi-square tests. Pure and estimated Chi-square fitting tests. Chi-square test for independence. Two independent sample Chi-square test for checking homogeneity of samples.
  4. Non-parametric tests II. Gnegyenko-Koroljuk Theorem. Order Statistics,  Fit Testing with one sample Kolmogorov-Smirnov test. Testing Homogeneity with two-sample Kolmogorov-Smirnov test.
  5. Tets of Homogeneity. Checking homogeneity of two independent sample with Mann-Whitney test. Checking homogeneity of several independent samples with Kruskal-Wallis test. Checking homogeneity two paired samples with Wilcoxon test. Checking homogeneity several paired samples with Friedmann test.
  6. Bivariate regression models. Theoretical background: the conditional expectation. The bivariate regression types: linear regression, polynomial regression, linear regression can be traced back two parameters. Logistic regression. The least squares method. Analysis of Variance (ANOVA) to decide the validity of the model. Coefficient of determination (R-squared). Nadaraja-method.
  7. Multivariate linear regression. Techniques of Model construction. Correlation coefficients: total-, multiple-, part-. The beta coefficients. In the adjusted coefficient of determination. Multicollinearity. Heteroskedasticity. Outlier points detection and analysis.
  8. Introduction to Business Intelligence and practical considerations. Data driven solution, CRISP-DM (CRoss Industry Standard Process for Data Mining). Customer Relationship Management (CRM) analytics.
  9. Preparation for enterprise data analytics. Frequent item sets, market busket analysis, association rules and their application in practice.
  10. Supervised machine learning. Evaluation metrics. Profit matrix. Simple algorithms: kNN, Naive-Bayes and useful metrics.
  11. Decision trees (DT) and their application to decision making. Some DT algorithms (C4.5, purity measures, splitting, pre- and post-pruningés), applications.
  12. Classification and regression. Customer value and churn prediction. Credibility prediction. Prediction in direct marketing campaign.
  13. Customer segmentation and other unsupervised learning problems. Various models for clustering by K-means (bisecting and adaptive). Density based clustering algorithms (DBSCAN, OPTICS) and hierarchical clustering and their application.
  14. Open and/or widely used data mining softwares, model building and practical constrains.

 

       Practical lessons

 

  1. Introduction to statistical software packages. Definition of basic statistics and their interpretation. Plots: scatter, box, pie chart and histograms, P-P and Q-Q plots, Dot plot. Confidence interval, parametric tests over economic type data sets.
  2. Statistical hypothesis testing over enterprise and business datasets. Independent testing and homogeneity, significance testing. Graphical and statistical analysis.
  3. Regression analysis over business type data sets. Model building and constrains. Evaluation: Multicollinearity, heteroscedasticity, sensitivity and outliers.
  4. Statistical data mining. Decision making with basic algorithms for credibility analysis.
  5. Frequent item sets over a bookstore dataset. Various quality measures of association rules (“list”) and their connection to decision trees.
  6. Modeling user behavior over a webshop and predict whether the user is going to buy something.
  7. Segmentation of customers according to past activities, group work with available tools.
9. Method of instruction

 3 lectures/week + 1 computer practice (in 7 double hours)

10. Assessment

a.) in the academic term: attandance is mandatory for all computer seminar. Students should arrive prepared based on the lectures and the previously handed out schedule and lectures. A test will assess the preparation at the beginning of each laboratory.  Records have to be taken of the work and then handed in at the end of the laboratory session. The participation and the records will be marked. Requirements for obtaining a signature: participation in minimum 70% of seminars. Course work is optional and additional points can be given when it reaches a sufficient level.

 

b.) exam period: written final exam. Determination of exam mark: 50% is the average of the top 70% of laboratory marks and 50% is the written exam when it passes.

11. Recaps Replacement of the computer seminars are impossible
12. Consultations

On the lecturer’s open office hours the students can keep contact with the teacher.

13. References, textbooks and resources

- P. Tan, M. Steinbach, V. Kumar: Introduction to Data Mining, Addison-Wesley, 2006, Cloth; 769 pp, ISBN-10: 0321321367, ISBN-13: 9780321321367

http://www-users.cs.umn.edu/~kumar/dmbook/index.php

- Leskovic, Rajraman, Ullmann: Mining of Massive Datasets
http://infolab.stanford.edu/~ullman/mmds.html

14. Required learning hours and assignment

Contact hours

42

Preparation for lectures

8

Preparation for practises

14

Preparation for midterm test

16

Preparation for exam

40

Total

120

15. Syllabus prepared by

László Ketskeméty, Bálint Daróczy, Gábor Szűcs