data mining: 3. persiapan data - amutiara.staff.gunadarma...

82
Data Mining: 3. Persiapan Data 3. Persiapan Data ABM ABM 1

Upload: doandang

Post on 23-Jul-2019

362 views

Category:

Documents


24 download

TRANSCRIPT

Page 1: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Data Mining:3. Persiapan Data

Data Mining:3. Persiapan Data

ABMABM

1

Page 2: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

2. Proses Data Mining

1. Pengantar Data Mining

Course Outline

6. Algoritma Asosiasi

5. Algoritma Klastering

4. Algoritma Klasifikasi

3. Persiapan Data

8. Text Mining

7. Algoritma Estimasi dan Forecasting

6. Algoritma Asosiasi

2

Page 3: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

3. Persiapan Data3.1 Data Preprocessing3.2 Data Cleaning3.3 Data Reduction3.4 Data Transformation and Data Discretization3.5 Data Integration

3.1 Data Preprocessing3.2 Data Cleaning3.3 Data Reduction3.4 Data Transformation and Data Discretization3.5 Data Integration

3

Page 4: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

3.1 Data Preprocessing

4

Page 5: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

CRISP-DM

5

Page 6: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not• Completeness: not recorded, unavailable, …• Consistency: some modified but some not, …• Timeliness: timely update?• Believability: how trustable the data are correct?• Interpretability: how easily the data can be

understood?

Why Preprocess the Data?

Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not• Completeness: not recorded, unavailable, …• Consistency: some modified but some not, …• Timeliness: timely update?• Believability: how trustable the data are correct?• Interpretability: how easily the data can be

understood?

Measures for data quality: A multidimensional view

• Accuracy: correct or wrong, accurate or not• Completeness: not recorded, unavailable, …• Consistency: some modified but some not, …• Timeliness: timely update?• Believability: how trustable the data are correct?• Interpretability: how easily the data can be

understood?

6

Page 7: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

1. Data cleaning• Fill in missing values• Smooth noisy data• Identify or remove outliers• Resolve inconsistencies

2. Data reduction• Dimensionality reduction• Numerosity reduction• Data compression

3. Data transformation and data discretization• Normalization• Concept hierarchy generation

4. Data integration• Integration of multiple databases or files

Major Tasks in Data Preprocessing

1. Data cleaning• Fill in missing values• Smooth noisy data• Identify or remove outliers• Resolve inconsistencies

2. Data reduction• Dimensionality reduction• Numerosity reduction• Data compression

3. Data transformation and data discretization• Normalization• Concept hierarchy generation

4. Data integration• Integration of multiple databases or files

1. Data cleaning• Fill in missing values• Smooth noisy data• Identify or remove outliers• Resolve inconsistencies

2. Data reduction• Dimensionality reduction• Numerosity reduction• Data compression

3. Data transformation and data discretization• Normalization• Concept hierarchy generation

4. Data integration• Integration of multiple databases or files

7

Page 8: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

3.2 Data Cleaning

8

Page 9: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Data in the Real World Is Dirty: Lots of potentiallyincorrect data, e.g., instrument faulty, human or computererror, transmission error

• Incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregate data

• e.g., Occupation=“ ” (missing data)• Noisy: containing noise, errors, or outliers

• e.g., Salary=“−10” (an error)• Inconsistent: containing discrepancies in codes or names

• e.g., Age=“42”, Birthday=“03/07/2010”• Was rating “1, 2, 3”, now rating “A, B, C”

• Discrepancy between duplicate records• Intentional (e.g., disguised missing data)• Jan. 1 as everyone’s birthday?

Data CleaningData in the Real World Is Dirty: Lots of potentiallyincorrect data, e.g., instrument faulty, human or computererror, transmission error

• Incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregate data

• e.g., Occupation=“ ” (missing data)• Noisy: containing noise, errors, or outliers

• e.g., Salary=“−10” (an error)• Inconsistent: containing discrepancies in codes or names

• e.g., Age=“42”, Birthday=“03/07/2010”• Was rating “1, 2, 3”, now rating “A, B, C”

• Discrepancy between duplicate records• Intentional (e.g., disguised missing data)• Jan. 1 as everyone’s birthday?

Data in the Real World Is Dirty: Lots of potentiallyincorrect data, e.g., instrument faulty, human or computererror, transmission error

• Incomplete: lacking attribute values, lacking certainattributes of interest, or containing only aggregate data

• e.g., Occupation=“ ” (missing data)• Noisy: containing noise, errors, or outliers

• e.g., Salary=“−10” (an error)• Inconsistent: containing discrepancies in codes or names

• e.g., Age=“42”, Birthday=“03/07/2010”• Was rating “1, 2, 3”, now rating “A, B, C”

• Discrepancy between duplicate records• Intentional (e.g., disguised missing data)• Jan. 1 as everyone’s birthday?

9

Page 10: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Data is not always available• E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

• Missing data may be due to• equipment malfunction• inconsistent with other recorded data and thus deleted• data not entered due to misunderstanding• certain data may not be considered important at the

time of entry• not register history or changes of the data

• Missing data may need to be inferred

Incomplete (Missing) Data

• Data is not always available• E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

• Missing data may be due to• equipment malfunction• inconsistent with other recorded data and thus deleted• data not entered due to misunderstanding• certain data may not be considered important at the

time of entry• not register history or changes of the data

• Missing data may need to be inferred

• Data is not always available• E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

• Missing data may be due to• equipment malfunction• inconsistent with other recorded data and thus deleted• data not entered due to misunderstanding• certain data may not be considered important at the

time of entry• not register history or changes of the data

• Missing data may need to be inferred

10

Page 11: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Dataset: MissingDataSet.csv

Contoh Missing Data

• Dataset: MissingDataSet.csv

11

Page 12: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Jerry is the marketing manager for a small Internet designand advertising firm

• Jerry’s boss asks him to develop a data set containinginformation about Internet users

• The company will use this data to determine what kinds ofpeople are using the Internet and how the firm may be ableto market their services to this group of users

• To accomplish his assignment, Jerry creates an online surveyand places links to the survey on several popular Web sites

• Within two weeks, Jerry has collected enough data to beginanalysis, but he finds that his data needs to bedenormalized

• He also notes that some observations in the set are missingvalues or they appear to contain invalid values

• Jerry realizes that some additional work on the data needsto take place before analysis begins.

MissingDataSet.csv• Jerry is the marketing manager for a small Internet design

and advertising firm• Jerry’s boss asks him to develop a data set containing

information about Internet users• The company will use this data to determine what kinds of

people are using the Internet and how the firm may be ableto market their services to this group of users

• To accomplish his assignment, Jerry creates an online surveyand places links to the survey on several popular Web sites

• Within two weeks, Jerry has collected enough data to beginanalysis, but he finds that his data needs to bedenormalized

• He also notes that some observations in the set are missingvalues or they appear to contain invalid values

• Jerry realizes that some additional work on the data needsto take place before analysis begins.

• Jerry is the marketing manager for a small Internet designand advertising firm

• Jerry’s boss asks him to develop a data set containinginformation about Internet users

• The company will use this data to determine what kinds ofpeople are using the Internet and how the firm may be ableto market their services to this group of users

• To accomplish his assignment, Jerry creates an online surveyand places links to the survey on several popular Web sites

• Within two weeks, Jerry has collected enough data to beginanalysis, but he finds that his data needs to bedenormalized

• He also notes that some observations in the set are missingvalues or they appear to contain invalid values

• Jerry realizes that some additional work on the data needsto take place before analysis begins.

12

Page 13: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Relational Data

13

Page 14: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

View of Data (Denormalized Data)

14

Page 15: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Dataset: MissingDataSet.csv

Contoh Missing Data

• Dataset: MissingDataSet.csv

15

Page 16: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Ignore the tuple:• Usually done when class label is missing (when doing

classification)—not effective when the % of missing valuesper attribute varies considerably

• Fill in the missing value manually:• Tedious + infeasible?

• Fill in it automatically with• A global constant: e.g., “unknown”, a new class?!• The attribute mean• The attribute mean for all samples belonging to the same

class: smarter• The most probable value: inference-based such as

Bayesian formula or decision tree

How to Handle Missing Data?• Ignore the tuple:

• Usually done when class label is missing (when doingclassification)—not effective when the % of missing valuesper attribute varies considerably

• Fill in the missing value manually:• Tedious + infeasible?

• Fill in it automatically with• A global constant: e.g., “unknown”, a new class?!• The attribute mean• The attribute mean for all samples belonging to the same

class: smarter• The most probable value: inference-based such as

Bayesian formula or decision tree

• Ignore the tuple:• Usually done when class label is missing (when doing

classification)—not effective when the % of missing valuesper attribute varies considerably

• Fill in the missing value manually:• Tedious + infeasible?

• Fill in it automatically with• A global constant: e.g., “unknown”, a new class?!• The attribute mean• The attribute mean for all samples belonging to the same

class: smarter• The most probable value: inference-based such as

Bayesian formula or decision tree

16

Page 17: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 30-46(Handling Missing Data)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut!

Latihan

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 30-46(Handling Missing Data)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut!

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 30-46(Handling Missing Data)

• Dataset: MissingDataSet.csv

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut!

17

Page 18: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Missing Value Detection

18

Page 19: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

19

Page 20: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Noise: random error or variance in a measuredvariable

• Incorrect attribute values may be due to• Faulty data collection instruments• Data entry problems• Data transmission problems• Technology limitation• Inconsistency in naming convention

• Other data problems which require data cleaning• Duplicate records• Incomplete data• Inconsistent data

Noisy Data

• Noise: random error or variance in a measuredvariable

• Incorrect attribute values may be due to• Faulty data collection instruments• Data entry problems• Data transmission problems• Technology limitation• Inconsistency in naming convention

• Other data problems which require data cleaning• Duplicate records• Incomplete data• Inconsistent data

• Noise: random error or variance in a measuredvariable

• Incorrect attribute values may be due to• Faulty data collection instruments• Data entry problems• Data transmission problems• Technology limitation• Inconsistency in naming convention

• Other data problems which require data cleaning• Duplicate records• Incomplete data• Inconsistent data

20

Page 21: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Binning• First sort data and partition into (equal-frequency) bins• Then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

• Regression• Smooth by fitting the data into regression functions

• Clustering• Detect and remove outliers

• Combined computer and human inspection• Detect suspicious values and check by human (e.g., deal

with possible outliers)

How to Handle Noisy Data?

• Binning• First sort data and partition into (equal-frequency) bins• Then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

• Regression• Smooth by fitting the data into regression functions

• Clustering• Detect and remove outliers

• Combined computer and human inspection• Detect suspicious values and check by human (e.g., deal

with possible outliers)

• Binning• First sort data and partition into (equal-frequency) bins• Then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

• Regression• Smooth by fitting the data into regression functions

• Clustering• Detect and remove outliers

• Combined computer and human inspection• Detect suspicious values and check by human (e.g., deal

with possible outliers)

21

Page 22: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Data discrepancy detection• Use metadata (e.g., domain, range, dependency, distribution)• Check field overloading• Check uniqueness rule, consecutive rule and null rule• Use commercial tools

• Data scrubbing: use simple domain knowledge (e.g., postal code,spell-check) to detect errors and make corrections

• Data auditing: by analyzing data to discover rules and relationshipto detect violators (e.g., correlation and clustering to find outliers)

• Data migration and integration• Data migration tools: allow transformations to be specified• ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface• Integration of the two processes

• Iterative and interactive (e.g., Potter’s Wheels)

Data Cleaning as a Process• Data discrepancy detection

• Use metadata (e.g., domain, range, dependency, distribution)• Check field overloading• Check uniqueness rule, consecutive rule and null rule• Use commercial tools

• Data scrubbing: use simple domain knowledge (e.g., postal code,spell-check) to detect errors and make corrections

• Data auditing: by analyzing data to discover rules and relationshipto detect violators (e.g., correlation and clustering to find outliers)

• Data migration and integration• Data migration tools: allow transformations to be specified• ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface• Integration of the two processes

• Iterative and interactive (e.g., Potter’s Wheels)

• Data discrepancy detection• Use metadata (e.g., domain, range, dependency, distribution)• Check field overloading• Check uniqueness rule, consecutive rule and null rule• Use commercial tools

• Data scrubbing: use simple domain knowledge (e.g., postal code,spell-check) to detect errors and make corrections

• Data auditing: by analyzing data to discover rules and relationshipto detect violators (e.g., correlation and clustering to find outliers)

• Data migration and integration• Data migration tools: allow transformations to be specified• ETL (Extraction/Transformation/Loading) tools: allow users to

specify transformations through a graphical user interface• Integration of the two processes

• Iterative and interactive (e.g., Potter’s Wheels)

22

Page 23: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 50-52(Handling Inconsistence Data)

• Dataset: MissingDataSet.csv• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukanpada dataset tersebut!

Latihan

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 50-52(Handling Inconsistence Data)

• Dataset: MissingDataSet.csv• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukanpada dataset tersebut!

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 50-52(Handling Inconsistence Data)

• Dataset: MissingDataSet.csv• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukanpada dataset tersebut!

23

Page 24: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

24

Page 25: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti buku Matthew North,Data Mining for the Masses, 2012, Chapter 8 Estimation,pp. 127-140 (Estimation)

• Dataset: HeatingOil.csv• Analisis metode preprocessing apa saja yang digunakan

dan mengapa perlu dilakukan pada dataset tersebut!

Latihan• Lakukan eksperimen mengikuti buku Matthew North,

Data Mining for the Masses, 2012, Chapter 8 Estimation,pp. 127-140 (Estimation)

• Dataset: HeatingOil.csv• Analisis metode preprocessing apa saja yang digunakan

dan mengapa perlu dilakukan pada dataset tersebut!

25

Page 26: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

3.3 Data Reduction

26

Page 27: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Data Reduction• Obtain a reduced representation of the data set that is much smaller in

volume but yet produces the same analytical results

• Why Data Reduction?• A database/data warehouse may store terabytes of data• Complex data analysis take a very long time to run on the complete

dataset

• Data Reduction Strategies1. Dimensionality reduction

1. Feature Extraction2. Feature Selection

2. Numerosity reduction (Data Reduction)• Regression and Log-Linear Models• Histograms, clustering, sampling

Data Reduction Strategies• Data Reduction

• Obtain a reduced representation of the data set that is much smaller involume but yet produces the same analytical results

• Why Data Reduction?• A database/data warehouse may store terabytes of data• Complex data analysis take a very long time to run on the complete

dataset

• Data Reduction Strategies1. Dimensionality reduction

1. Feature Extraction2. Feature Selection

2. Numerosity reduction (Data Reduction)• Regression and Log-Linear Models• Histograms, clustering, sampling

• Data Reduction• Obtain a reduced representation of the data set that is much smaller in

volume but yet produces the same analytical results

• Why Data Reduction?• A database/data warehouse may store terabytes of data• Complex data analysis take a very long time to run on the complete

dataset

• Data Reduction Strategies1. Dimensionality reduction

1. Feature Extraction2. Feature Selection

2. Numerosity reduction (Data Reduction)• Regression and Log-Linear Models• Histograms, clustering, sampling

27

Page 28: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Curse of dimensionality• When dimensionality increases, data becomes increasingly

sparse• Density and distance between points, which is critical to

clustering, outlier analysis, becomes less meaningful• The possible combinations of subspaces will grow

exponentially• Dimensionality reduction

• Avoid the curse of dimensionality• Help eliminate irrelevant features and reduce noise• Reduce time and space required in data mining• Allow easier visualization

• Dimensionality reduction techniques1. Feature Extraction: Wavelet transforms, Principal

Component Analysis (PCA)2. Feature Selection: Filter, Wrapper, Embedded

1. Dimensionality Reduction• Curse of dimensionality

• When dimensionality increases, data becomes increasinglysparse

• Density and distance between points, which is critical toclustering, outlier analysis, becomes less meaningful

• The possible combinations of subspaces will growexponentially

• Dimensionality reduction• Avoid the curse of dimensionality• Help eliminate irrelevant features and reduce noise• Reduce time and space required in data mining• Allow easier visualization

• Dimensionality reduction techniques1. Feature Extraction: Wavelet transforms, Principal

Component Analysis (PCA)2. Feature Selection: Filter, Wrapper, Embedded

• Curse of dimensionality• When dimensionality increases, data becomes increasingly

sparse• Density and distance between points, which is critical to

clustering, outlier analysis, becomes less meaningful• The possible combinations of subspaces will grow

exponentially• Dimensionality reduction

• Avoid the curse of dimensionality• Help eliminate irrelevant features and reduce noise• Reduce time and space required in data mining• Allow easier visualization

• Dimensionality reduction techniques1. Feature Extraction: Wavelet transforms, Principal

Component Analysis (PCA)2. Feature Selection: Filter, Wrapper, Embedded

28

Page 29: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Given N data vectors from n-dimensions, find k ≤ northogonal vectors (principal components) that can bebest used to represent data

1. Normalize input data: Each attribute falls within the same range2. Compute k orthonormal (unit) vectors, i.e., principal components3. Each input data (vector) is a linear combination of the k principal

component vectors4. The principal components are sorted in order of decreasing

“significance” or strength5. Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with lowvariance

• Works for numeric data only

Principal Component Analysis (Steps)

• Given N data vectors from n-dimensions, find k ≤ northogonal vectors (principal components) that can bebest used to represent data

1. Normalize input data: Each attribute falls within the same range2. Compute k orthonormal (unit) vectors, i.e., principal components3. Each input data (vector) is a linear combination of the k principal

component vectors4. The principal components are sorted in order of decreasing

“significance” or strength5. Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with lowvariance

• Works for numeric data only

• Given N data vectors from n-dimensions, find k ≤ northogonal vectors (principal components) that can bebest used to represent data

1. Normalize input data: Each attribute falls within the same range2. Compute k orthonormal (unit) vectors, i.e., principal components3. Each input data (vector) is a linear combination of the k principal

component vectors4. The principal components are sorted in order of decreasing

“significance” or strength5. Since the components are sorted, the size of the data can be

reduced by eliminating the weak components, i.e., those with lowvariance

• Works for numeric data only29

Page 30: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining UseCase) Chapter 4 (k-Nearest NeighborClassification II) pp. 45-51

• Dataset: glass.data• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukanpada dataset tersebut!

• Bandingkan akurasi dari k-NN dan PCA+k-NN

Latihan

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining UseCase) Chapter 4 (k-Nearest NeighborClassification II) pp. 45-51

• Dataset: glass.data• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukanpada dataset tersebut!

• Bandingkan akurasi dari k-NN dan PCA+k-NN

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining UseCase) Chapter 4 (k-Nearest NeighborClassification II) pp. 45-51

• Dataset: glass.data• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukanpada dataset tersebut!

• Bandingkan akurasi dari k-NN dan PCA+k-NN

30

Page 31: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

31

Page 32: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

32

Page 33: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Ganti PCA denganmetode dimensionreduction yang lain

• Cek di RapidMiner,operator apa saja yangbisa digunakan untukmengurangi dimensidari dataset

Latihan

• Ganti PCA denganmetode dimensionreduction yang lain

• Cek di RapidMiner,operator apa saja yangbisa digunakan untukmengurangi dimensidari dataset

• Ganti PCA denganmetode dimensionreduction yang lain

• Cek di RapidMiner,operator apa saja yangbisa digunakan untukmengurangi dimensidari dataset

33

Page 34: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Another way to reduce dimensionality of data• Redundant attributes

• Duplicate much or all of the information containedin one or more other attributes

• E.g., purchase price of a product and the amount ofsales tax paid

• Irrelevant attributes• Contain no information that is useful for the data

mining task at hand• E.g., students' ID is often irrelevant to the task of

predicting students' GPA

Feature/Attribute Selection

• Another way to reduce dimensionality of data• Redundant attributes

• Duplicate much or all of the information containedin one or more other attributes

• E.g., purchase price of a product and the amount ofsales tax paid

• Irrelevant attributes• Contain no information that is useful for the data

mining task at hand• E.g., students' ID is often irrelevant to the task of

predicting students' GPA

• Another way to reduce dimensionality of data• Redundant attributes

• Duplicate much or all of the information containedin one or more other attributes

• E.g., purchase price of a product and the amount ofsales tax paid

• Irrelevant attributes• Contain no information that is useful for the data

mining task at hand• E.g., students' ID is often irrelevant to the task of

predicting students' GPA

34

Page 35: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

A number of proposed approaches for featureselection can broadly be categorized into thefollowing three classifications: wrapper, filter, andhybrid (Liu & Tu, 2004)

1. In the filter approach, statistical analysis of thefeature set is required, without utilizing any learningmodel (Dash & Liu, 1997)

2. In the wrapper approach, a predetermined learningmodel is assumed, wherein features are selected thatjustify the learning performance of the particularlearning model (Guyon & Elisseeff, 2003)

3. The hybrid approach attempts to utilize thecomplementary strengths of the wrapper and filterapproaches (Huang, Cai, & Xu, 2007)

Feature Selection Approach

A number of proposed approaches for featureselection can broadly be categorized into thefollowing three classifications: wrapper, filter, andhybrid (Liu & Tu, 2004)

1. In the filter approach, statistical analysis of thefeature set is required, without utilizing any learningmodel (Dash & Liu, 1997)

2. In the wrapper approach, a predetermined learningmodel is assumed, wherein features are selected thatjustify the learning performance of the particularlearning model (Guyon & Elisseeff, 2003)

3. The hybrid approach attempts to utilize thecomplementary strengths of the wrapper and filterapproaches (Huang, Cai, & Xu, 2007)

A number of proposed approaches for featureselection can broadly be categorized into thefollowing three classifications: wrapper, filter, andhybrid (Liu & Tu, 2004)

1. In the filter approach, statistical analysis of thefeature set is required, without utilizing any learningmodel (Dash & Liu, 1997)

2. In the wrapper approach, a predetermined learningmodel is assumed, wherein features are selected thatjustify the learning performance of the particularlearning model (Guyon & Elisseeff, 2003)

3. The hybrid approach attempts to utilize thecomplementary strengths of the wrapper and filterapproaches (Huang, Cai, & Xu, 2007)

35

Page 36: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Wrapper Approach vs Filter Approach

36

Page 37: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

1. Filter Approach:• information gain• chi square• log likehood ratio

2. Wrapper Approach:• forward selection• backward elimination• randomized hill climbing

3. Embedded Approach:• decision tree• weighted naïve bayes

Feature Selection Approach

1. Filter Approach:• information gain• chi square• log likehood ratio

2. Wrapper Approach:• forward selection• backward elimination• randomized hill climbing

3. Embedded Approach:• decision tree• weighted naïve bayes

1. Filter Approach:• information gain• chi square• log likehood ratio

2. Wrapper Approach:• forward selection• backward elimination• randomized hill climbing

3. Embedded Approach:• decision tree• weighted naïve bayes

37

Page 38: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Latihan

• Lakukan eksperimen mengikutibuku Markus Hofmann (RapidMiner - Data Mining Use Case)Chapter 4 (k-Nearest NeighborClassification II)

• Ganti PCA dengan metodefeature selection (filter),misalnya:

• Information Gain• Chi Squared• etc

• Cek di RapidMiner, operator apasaja yang bisa digunakan untukmengurangi atau membobotatribute dari dataset!

• Lakukan eksperimen mengikutibuku Markus Hofmann (RapidMiner - Data Mining Use Case)Chapter 4 (k-Nearest NeighborClassification II)

• Ganti PCA dengan metodefeature selection (filter),misalnya:

• Information Gain• Chi Squared• etc

• Cek di RapidMiner, operator apasaja yang bisa digunakan untukmengurangi atau membobotatribute dari dataset!

38

• Lakukan eksperimen mengikutibuku Markus Hofmann (RapidMiner - Data Mining Use Case)Chapter 4 (k-Nearest NeighborClassification II)

• Ganti PCA dengan metodefeature selection (filter),misalnya:

• Information Gain• Chi Squared• etc

• Cek di RapidMiner, operator apasaja yang bisa digunakan untukmengurangi atau membobotatribute dari dataset!

Page 39: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

39

Page 40: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining Use Case)Chapter 4 (k-Nearest Neighbor Classification II)

• Ganti PCA dengan metode feature selection(wrapper), misalnya:

• Backward Elimination• Forward Selection• etc

• Ganti metode validasi dengan 10-Fold XValidation

• Bandingkan akurasi dari k-NN dan BE+k-NN orFS+k-NN

Latihan• Lakukan eksperimen mengikuti buku Markus

Hofmann (Rapid Miner - Data Mining Use Case)Chapter 4 (k-Nearest Neighbor Classification II)

• Ganti PCA dengan metode feature selection(wrapper), misalnya:

• Backward Elimination• Forward Selection• etc

• Ganti metode validasi dengan 10-Fold XValidation

• Bandingkan akurasi dari k-NN dan BE+k-NN orFS+k-NN

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining Use Case)Chapter 4 (k-Nearest Neighbor Classification II)

• Ganti PCA dengan metode feature selection(wrapper), misalnya:

• Backward Elimination• Forward Selection• etc

• Ganti metode validasi dengan 10-Fold XValidation

• Bandingkan akurasi dari k-NN dan BE+k-NN orFS+k-NN

40

Page 41: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

41

Page 42: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

1. Lakukan training pada data mahasiswa(datakelulusanmahasiswa.xls) dengan menggunakanDecision Tree (DT)

2. Lakukan feature selection dengan Forward Selection untukalgoritma DT (DT+FS)

3. Lakukan feature selection dengan Backward Eliminationuntuk algoritma DT (DT+BE)

4. Lakukan pengujian dengan menggunakan 10-fold XValidation

5. Uji beda dengan t-Test untuk mendapatkan model terbaik(DT vs DT+FS vs DT+BE)

Latihan: Prediksi Kelulusan Mahasiswa1. Lakukan training pada data mahasiswa

(datakelulusanmahasiswa.xls) dengan menggunakanDecision Tree (DT)

2. Lakukan feature selection dengan Forward Selection untukalgoritma DT (DT+FS)

3. Lakukan feature selection dengan Backward Eliminationuntuk algoritma DT (DT+BE)

4. Lakukan pengujian dengan menggunakan 10-fold XValidation

5. Uji beda dengan t-Test untuk mendapatkan model terbaik(DT vs DT+FS vs DT+BE)

1. Lakukan training pada data mahasiswa(datakelulusanmahasiswa.xls) dengan menggunakanDecision Tree (DT)

2. Lakukan feature selection dengan Forward Selection untukalgoritma DT (DT+FS)

3. Lakukan feature selection dengan Backward Eliminationuntuk algoritma DT (DT+BE)

4. Lakukan pengujian dengan menggunakan 10-fold XValidation

5. Uji beda dengan t-Test untuk mendapatkan model terbaik(DT vs DT+FS vs DT+BE)

42

DT DT+FS DT+BE

Accuracy 91.29 92.63 91.81

AUC 0.893 0.919 0.906

Page 43: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

1. Lakukan training pada data mahasiswa(datakelulusanmahasiswa.xls) dengan menggunakanDT, NB, K-NN

2. Lakukan dimension reduction dengan ForwardSelection untuk ketiga algoritma di atas

3. Lakukan pengujian dengan menggunakan 10-fold XValidation

4. Uji beda dengan t-Test untuk mendapatkan modelterbaik

Latihan: Prediksi Kelulusan Mahasiswa1. Lakukan training pada data mahasiswa

(datakelulusanmahasiswa.xls) dengan menggunakanDT, NB, K-NN

2. Lakukan dimension reduction dengan ForwardSelection untuk ketiga algoritma di atas

3. Lakukan pengujian dengan menggunakan 10-fold XValidation

4. Uji beda dengan t-Test untuk mendapatkan modelterbaik

1. Lakukan training pada data mahasiswa(datakelulusanmahasiswa.xls) dengan menggunakanDT, NB, K-NN

2. Lakukan dimension reduction dengan ForwardSelection untuk ketiga algoritma di atas

3. Lakukan pengujian dengan menggunakan 10-fold XValidation

4. Uji beda dengan t-Test untuk mendapatkan modelterbaik

43

DT NB K-NN DT+FS NB+FS K-NN+FS

Accuracy

AUC

Page 44: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan training pada data eReader Adoption(eReader-Training.csv) dengan menggunakan DTdengan 3 alternative criterion (Gain Ratio,Information Gain dan Gini Index)

• Lakukan feature selection dengan Forward Selectionuntuk ketiga algoritma di atas

• Lakukan pengujian dengan menggunakan 10-fold XValidation

• Dari model terbaik, tentukan faktor (atribut) apa sajayang berpengaruh pada tingkat adopsi eReader

Latihan• Lakukan training pada data eReader Adoption

(eReader-Training.csv) dengan menggunakan DTdengan 3 alternative criterion (Gain Ratio,Information Gain dan Gini Index)

• Lakukan feature selection dengan Forward Selectionuntuk ketiga algoritma di atas

• Lakukan pengujian dengan menggunakan 10-fold XValidation

• Dari model terbaik, tentukan faktor (atribut) apa sajayang berpengaruh pada tingkat adopsi eReader

• Lakukan training pada data eReader Adoption(eReader-Training.csv) dengan menggunakan DTdengan 3 alternative criterion (Gain Ratio,Information Gain dan Gini Index)

• Lakukan feature selection dengan Forward Selectionuntuk ketiga algoritma di atas

• Lakukan pengujian dengan menggunakan 10-fold XValidation

• Dari model terbaik, tentukan faktor (atribut) apa sajayang berpengaruh pada tingkat adopsi eReader

44

DTGR DTIG DTGI DTGR+FS DTIG+FS DTGI+FS

Accuracy 58.39 51.01 31.01 61.41 56.73 31.01

Page 45: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

45

Page 46: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Reduce data volume by choosing alternative, smaller forms ofdata representation

1. Parametric methods (e.g., regression)• Assume the data fits some model, estimate model

parameters, store only the parameters, and discardthe data (except possible outliers)

• Ex.: Log-linear models—obtain value at a point in m-Dspace as the product on appropriate marginalsubspaces

2. Non-parametric methods• Do not assume models• Major families: histograms, clustering, sampling, …

2. Numerosity Reduction

Reduce data volume by choosing alternative, smaller forms ofdata representation

1. Parametric methods (e.g., regression)• Assume the data fits some model, estimate model

parameters, store only the parameters, and discardthe data (except possible outliers)

• Ex.: Log-linear models—obtain value at a point in m-Dspace as the product on appropriate marginalsubspaces

2. Non-parametric methods• Do not assume models• Major families: histograms, clustering, sampling, …

Reduce data volume by choosing alternative, smaller forms ofdata representation

1. Parametric methods (e.g., regression)• Assume the data fits some model, estimate model

parameters, store only the parameters, and discardthe data (except possible outliers)

• Ex.: Log-linear models—obtain value at a point in m-Dspace as the product on appropriate marginalsubspaces

2. Non-parametric methods• Do not assume models• Major families: histograms, clustering, sampling, …

46

Page 47: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Linear regression• Data modeled to fit a straight line• Often uses the least-square method to fit the

line• Multiple regression

• Allows a response variable Y to be modeled as alinear function of multidimensional featurevector

• Log-linear model• Approximates discrete multidimensional

probability distributions

Parametric Data Reduction: Regression andLog-Linear Models

• Linear regression• Data modeled to fit a straight line• Often uses the least-square method to fit the

line• Multiple regression

• Allows a response variable Y to be modeled as alinear function of multidimensional featurevector

• Log-linear model• Approximates discrete multidimensional

probability distributions

• Linear regression• Data modeled to fit a straight line• Often uses the least-square method to fit the

line• Multiple regression

• Allows a response variable Y to be modeled as alinear function of multidimensional featurevector

• Log-linear model• Approximates discrete multidimensional

probability distributions

47

Page 48: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Regression analysis: A collective name fortechniques for the modeling and analysis ofnumerical data consisting of values of adependent variable (also called responsevariable or measurement) and of one or moreindependent variables (aka. explanatoryvariables or predictors)

• The parameters are estimated so as to give a"best fit" of the data

• Most commonly the best fit is evaluated byusing the least squares method, but othercriteria have also been used

• Used for prediction (including forecasting oftime-series data), inference, hypothesistesting, and modeling of causal relationships

Regression Analysis• Regression analysis: A collective name for

techniques for the modeling and analysis ofnumerical data consisting of values of adependent variable (also called responsevariable or measurement) and of one or moreindependent variables (aka. explanatoryvariables or predictors)

• The parameters are estimated so as to give a"best fit" of the data

• Most commonly the best fit is evaluated byusing the least squares method, but othercriteria have also been used

• Used for prediction (including forecasting oftime-series data), inference, hypothesistesting, and modeling of causal relationships

y = x + 1

Y1

Y1’

• Regression analysis: A collective name fortechniques for the modeling and analysis ofnumerical data consisting of values of adependent variable (also called responsevariable or measurement) and of one or moreindependent variables (aka. explanatoryvariables or predictors)

• The parameters are estimated so as to give a"best fit" of the data

• Most commonly the best fit is evaluated byusing the least squares method, but othercriteria have also been used

• Used for prediction (including forecasting oftime-series data), inference, hypothesistesting, and modeling of causal relationships

48

xX1

Page 49: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Linear regression: Y = w X + b• Two regression coefficients, w and b, specify the line and are to be

estimated by using the data at hand• Using the least squares criterion to the known values of Y1, Y2, …, X1,

X2, ….

• Multiple regression: Y = b0 + b1 X1 + b2 X2• Many nonlinear functions can be transformed into the above

• Log-linear models:• Approximate discrete multidimensional probability distributions• Estimate the probability of each point (tuple) in a multi-dimensional

space for a set of discretized attributes, based on a smaller subset ofdimensional combinations

• Useful for dimensionality reduction and data smoothing

Regress Analysis and Log-Linear Models

• Linear regression: Y = w X + b• Two regression coefficients, w and b, specify the line and are to be

estimated by using the data at hand• Using the least squares criterion to the known values of Y1, Y2, …, X1,

X2, ….

• Multiple regression: Y = b0 + b1 X1 + b2 X2• Many nonlinear functions can be transformed into the above

• Log-linear models:• Approximate discrete multidimensional probability distributions• Estimate the probability of each point (tuple) in a multi-dimensional

space for a set of discretized attributes, based on a smaller subset ofdimensional combinations

• Useful for dimensionality reduction and data smoothing

• Linear regression: Y = w X + b• Two regression coefficients, w and b, specify the line and are to be

estimated by using the data at hand• Using the least squares criterion to the known values of Y1, Y2, …, X1,

X2, ….

• Multiple regression: Y = b0 + b1 X1 + b2 X2• Many nonlinear functions can be transformed into the above

• Log-linear models:• Approximate discrete multidimensional probability distributions• Estimate the probability of each point (tuple) in a multi-dimensional

space for a set of discretized attributes, based on a smaller subset ofdimensional combinations

• Useful for dimensionality reduction and data smoothing

49

Page 50: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Divide data into buckets andstore average (sum) for eachbucket

• Partitioning rules:• Equal-width: equal bucket

range• Equal-frequency (or equal-

depth)

Histogram Analysis

05

1015

2025

3035

40

10

00

0

20

00

0

30

00

0

40

00

0

50

00

0

60

00

0

70

00

0

80

00

0

90

00

0

10

00

00

• Divide data into buckets andstore average (sum) for eachbucket

• Partitioning rules:• Equal-width: equal bucket

range• Equal-frequency (or equal-

depth) 05

1015

2025

3035

40

10

00

0

20

00

0

30

00

0

40

00

0

50

00

0

60

00

0

70

00

0

80

00

0

90

00

0

10

00

00

• Divide data into buckets andstore average (sum) for eachbucket

• Partitioning rules:• Equal-width: equal bucket

range• Equal-frequency (or equal-

depth)

50

05

1015

2025

3035

40

10

00

0

20

00

0

30

00

0

40

00

0

50

00

0

60

00

0

70

00

0

80

00

0

90

00

0

10

00

00

Page 51: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Partition data set into clusters based onsimilarity, and store cluster representation (e.g.,centroid and diameter) only

• Can be very effective if data is clustered but notif data is “smeared”

• Can have hierarchical clustering and be stored inmulti-dimensional index tree structures

• There are many choices of clustering definitionsand clustering algorithms

Clustering

• Partition data set into clusters based onsimilarity, and store cluster representation (e.g.,centroid and diameter) only

• Can be very effective if data is clustered but notif data is “smeared”

• Can have hierarchical clustering and be stored inmulti-dimensional index tree structures

• There are many choices of clustering definitionsand clustering algorithms

• Partition data set into clusters based onsimilarity, and store cluster representation (e.g.,centroid and diameter) only

• Can be very effective if data is clustered but notif data is “smeared”

• Can have hierarchical clustering and be stored inmulti-dimensional index tree structures

• There are many choices of clustering definitionsand clustering algorithms

51

Page 52: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Sampling: obtaining a small sample s to representthe whole data set N

• Allow a mining algorithm to run in complexity thatis potentially sub-linear to the size of the data

• Key principle: Choose a representative subset ofthe data

• Simple random sampling may have very poor performance in thepresence of skew

• Develop adaptive sampling methods, e.g., stratified sampling

• Note: Sampling may not reduce database I/Os(page at a time)

Sampling

• Sampling: obtaining a small sample s to representthe whole data set N

• Allow a mining algorithm to run in complexity thatis potentially sub-linear to the size of the data

• Key principle: Choose a representative subset ofthe data

• Simple random sampling may have very poor performance in thepresence of skew

• Develop adaptive sampling methods, e.g., stratified sampling

• Note: Sampling may not reduce database I/Os(page at a time)

• Sampling: obtaining a small sample s to representthe whole data set N

• Allow a mining algorithm to run in complexity thatis potentially sub-linear to the size of the data

• Key principle: Choose a representative subset ofthe data

• Simple random sampling may have very poor performance in thepresence of skew

• Develop adaptive sampling methods, e.g., stratified sampling

• Note: Sampling may not reduce database I/Os(page at a time)

52

Page 53: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Simple random sampling• There is an equal probability of selecting any particular

item• Sampling without replacement

• Once an object is selected, it is removed from thepopulation

• Sampling with replacement• A selected object is not removed from the population

• Stratified sampling• Partition the data set, and draw samples from each

partition (proportionally, i.e., approximately the samepercentage of the data)

• Used in conjunction with skewed data

Types of Sampling

• Simple random sampling• There is an equal probability of selecting any particular

item• Sampling without replacement

• Once an object is selected, it is removed from thepopulation

• Sampling with replacement• A selected object is not removed from the population

• Stratified sampling• Partition the data set, and draw samples from each

partition (proportionally, i.e., approximately the samepercentage of the data)

• Used in conjunction with skewed data

• Simple random sampling• There is an equal probability of selecting any particular

item• Sampling without replacement

• Once an object is selected, it is removed from thepopulation

• Sampling with replacement• A selected object is not removed from the population

• Stratified sampling• Partition the data set, and draw samples from each

partition (proportionally, i.e., approximately the samepercentage of the data)

• Used in conjunction with skewed data

53

Page 54: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Sampling: With or without Replacement

54

Raw Data

Page 55: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified SampleRaw Data

55

Page 56: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Stratification is the process of dividing members of thepopulation into homogeneous subgroups before sampling

• Suppose that in a company there are the following staff:• Male, full-time: 90• Male, part-time: 18• Female, full-time: 9• Female, part-time: 63• Total: 180

• We are asked to take a sample of 40 staff, stratifiedaccording to the above categories

• An easy way to calculate the percentage is to multiply eachgroup size by the sample size and divide by the totalpopulation:

• Male, full-time = 90 × (40 ÷ 180) = 20• Male, part-time = 18 × (40 ÷ 180) = 4• Female, full-time = 9 × (40 ÷ 180) = 2• Female, part-time = 63 × (40 ÷ 180) = 14

Stratified Sampling• Stratification is the process of dividing members of the

population into homogeneous subgroups before sampling• Suppose that in a company there are the following staff:

• Male, full-time: 90• Male, part-time: 18• Female, full-time: 9• Female, part-time: 63• Total: 180

• We are asked to take a sample of 40 staff, stratifiedaccording to the above categories

• An easy way to calculate the percentage is to multiply eachgroup size by the sample size and divide by the totalpopulation:

• Male, full-time = 90 × (40 ÷ 180) = 20• Male, part-time = 18 × (40 ÷ 180) = 4• Female, full-time = 9 × (40 ÷ 180) = 2• Female, part-time = 63 × (40 ÷ 180) = 14

• Stratification is the process of dividing members of thepopulation into homogeneous subgroups before sampling

• Suppose that in a company there are the following staff:• Male, full-time: 90• Male, part-time: 18• Female, full-time: 9• Female, part-time: 63• Total: 180

• We are asked to take a sample of 40 staff, stratifiedaccording to the above categories

• An easy way to calculate the percentage is to multiply eachgroup size by the sample size and divide by the totalpopulation:

• Male, full-time = 90 × (40 ÷ 180) = 20• Male, part-time = 18 × (40 ÷ 180) = 4• Female, full-time = 9 × (40 ÷ 180) = 2• Female, part-time = 63 × (40 ÷ 180) = 14

56

Page 57: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 7 Discriminant Analysis, pp.105-125

• Datasets: SportSkill-Training.csv danSportSkill-Scoring.csv

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut!

Latihan

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 7 Discriminant Analysis, pp.105-125

• Datasets: SportSkill-Training.csv danSportSkill-Scoring.csv

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut!

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 7 Discriminant Analysis, pp.105-125

• Datasets: SportSkill-Training.csv danSportSkill-Scoring.csv

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut!

57

Page 58: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 46-50(Data Reduction)

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut

Latihan

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 46-50(Data Reduction)

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut

• Lakukan eksperimen mengikuti bukuMatthew North, Data Mining for the Masses,2012, Chapter 3 Data Preparation, pp. 46-50(Data Reduction)

• Analisis metode preprocessing apa saja yangdigunakan dan mengapa perlu dilakukanpada dataset tersebut

58

Page 59: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

3.4 Data Transformation and DataDiscretization

59

Page 60: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• A function that maps the entire set of values of a givenattribute to a new set of replacement values

• Each old value can be identified with one of the new values

• Methods:• Smoothing: Remove noise from data• Attribute/feature construction

• New attributes constructed from the given ones

• Aggregation: Summarization, data cube construction• Normalization: Scaled to fall within a smaller, specified range

• min-max normalization• z-score normalization• normalization by decimal scaling

• Discretization: Concept hierarchy climbing

Data Transformation

• A function that maps the entire set of values of a givenattribute to a new set of replacement values

• Each old value can be identified with one of the new values

• Methods:• Smoothing: Remove noise from data• Attribute/feature construction

• New attributes constructed from the given ones

• Aggregation: Summarization, data cube construction• Normalization: Scaled to fall within a smaller, specified range

• min-max normalization• z-score normalization• normalization by decimal scaling

• Discretization: Concept hierarchy climbing

• A function that maps the entire set of values of a givenattribute to a new set of replacement values

• Each old value can be identified with one of the new values

• Methods:• Smoothing: Remove noise from data• Attribute/feature construction

• New attributes constructed from the given ones

• Aggregation: Summarization, data cube construction• Normalization: Scaled to fall within a smaller, specified range

• min-max normalization• z-score normalization• normalization by decimal scaling

• Discretization: Concept hierarchy climbing60

Page 61: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Min-max normalization: to [new_minA, new_maxA]

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].Then $73,000 is mapped to

• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then

• Normalization by decimal scaling

Normalization

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

• Min-max normalization: to [new_minA, new_maxA]

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].Then $73,000 is mapped to

• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then

• Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

• Min-max normalization: to [new_minA, new_maxA]

• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0].Then $73,000 is mapped to

• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then

• Normalization by decimal scaling

61

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

Page 62: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Three types of attributes• Nominal —values from an unordered set, e.g., color,

profession• Ordinal —values from an ordered set, e.g., military or

academic rank• Numeric —real numbers, e.g., integer or real numbers

• Discretization: Divide the range of a continuousattribute into intervals

• Interval labels can then be used to replace actual data values• Reduce data size by discretization• Supervised vs. unsupervised• Split (top-down) vs. merge (bottom-up)• Discretization can be performed recursively on an attribute• Prepare for further analysis, e.g., classification

Discretization

• Three types of attributes• Nominal —values from an unordered set, e.g., color,

profession• Ordinal —values from an ordered set, e.g., military or

academic rank• Numeric —real numbers, e.g., integer or real numbers

• Discretization: Divide the range of a continuousattribute into intervals

• Interval labels can then be used to replace actual data values• Reduce data size by discretization• Supervised vs. unsupervised• Split (top-down) vs. merge (bottom-up)• Discretization can be performed recursively on an attribute• Prepare for further analysis, e.g., classification

• Three types of attributes• Nominal —values from an unordered set, e.g., color,

profession• Ordinal —values from an ordered set, e.g., military or

academic rank• Numeric —real numbers, e.g., integer or real numbers

• Discretization: Divide the range of a continuousattribute into intervals

• Interval labels can then be used to replace actual data values• Reduce data size by discretization• Supervised vs. unsupervised• Split (top-down) vs. merge (bottom-up)• Discretization can be performed recursively on an attribute• Prepare for further analysis, e.g., classification

62

Page 63: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Typical methods: All the methods can beapplied recursively

• Binning: Top-down split, unsupervised

• Histogram analysis: Top-down split, unsupervised

• Clustering analysis: Unsupervised, top-down splitor bottom-up merge

• Decision-tree analysis: Supervised, top-downsplit

• Correlation (e.g., 2) analysis: Unsupervised,bottom-up merge

Data Discretization Methods

Typical methods: All the methods can beapplied recursively

• Binning: Top-down split, unsupervised

• Histogram analysis: Top-down split, unsupervised

• Clustering analysis: Unsupervised, top-down splitor bottom-up merge

• Decision-tree analysis: Supervised, top-downsplit

• Correlation (e.g., 2) analysis: Unsupervised,bottom-up merge

Typical methods: All the methods can beapplied recursively

• Binning: Top-down split, unsupervised

• Histogram analysis: Top-down split, unsupervised

• Clustering analysis: Unsupervised, top-down splitor bottom-up merge

• Decision-tree analysis: Supervised, top-downsplit

• Correlation (e.g., 2) analysis: Unsupervised,bottom-up merge

63

Page 64: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Equal-width (distance) partitioning• Divides the range into N intervals of equal size: uniform

grid• if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N.• The most straightforward, but outliers may dominate

presentation• Skewed data is not handled well

• Equal-depth (frequency) partitioning• Divides the range into N intervals, each containing

approximately same number of samples• Good data scaling• Managing categorical attributes can be tricky

Simple Discretization: Binning

• Equal-width (distance) partitioning• Divides the range into N intervals of equal size: uniform

grid• if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N.• The most straightforward, but outliers may dominate

presentation• Skewed data is not handled well

• Equal-depth (frequency) partitioning• Divides the range into N intervals, each containing

approximately same number of samples• Good data scaling• Managing categorical attributes can be tricky

• Equal-width (distance) partitioning• Divides the range into N intervals of equal size: uniform

grid• if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B –A)/N.• The most straightforward, but outliers may dominate

presentation• Skewed data is not handled well

• Equal-depth (frequency) partitioning• Divides the range into N intervals, each containing

approximately same number of samples• Good data scaling• Managing categorical attributes can be tricky

64

Page 65: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,25, 26, 28, 29, 34

• Partition into equal-frequency (equi-depth) bins:• Bin 1: 4, 8, 9, 15• Bin 2: 21, 21, 24, 25• Bin 3: 26, 28, 29, 34

• Smoothing by bin means:• Bin 1: 9, 9, 9, 9• Bin 2: 23, 23, 23, 23• Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries:• Bin 1: 4, 4, 4, 15• Bin 2: 21, 21, 25, 25• Bin 3: 26, 26, 26, 34

Binning Methods for Data Smoothing

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,25, 26, 28, 29, 34

• Partition into equal-frequency (equi-depth) bins:• Bin 1: 4, 8, 9, 15• Bin 2: 21, 21, 24, 25• Bin 3: 26, 28, 29, 34

• Smoothing by bin means:• Bin 1: 9, 9, 9, 9• Bin 2: 23, 23, 23, 23• Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries:• Bin 1: 4, 4, 4, 15• Bin 2: 21, 21, 25, 25• Bin 3: 26, 26, 26, 34

Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24,25, 26, 28, 29, 34

• Partition into equal-frequency (equi-depth) bins:• Bin 1: 4, 8, 9, 15• Bin 2: 21, 21, 24, 25• Bin 3: 26, 28, 29, 34

• Smoothing by bin means:• Bin 1: 9, 9, 9, 9• Bin 2: 23, 23, 23, 23• Bin 3: 29, 29, 29, 29

• Smoothing by bin boundaries:• Bin 1: 4, 4, 4, 15• Bin 2: 21, 21, 25, 25• Bin 3: 26, 26, 26, 34

65

Page 66: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Discretization Without Using Class Labels(Binning vs. Clustering)

Data Equal interval width (binning)

66

Equal frequency (binning) K-means clustering leads to better results

Page 67: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Classification (e.g., decision tree analysis)• Supervised: Given class labels, e.g., cancerous vs. benign• Using entropy to determine split point (discretization point)• Top-down, recursive split

• Correlation analysis (e.g., Chi-merge: χ2-baseddiscretization)

• Supervised: use class information• Bottom-up merge: find the best neighboring intervals (those

having similar distributions of classes, i.e., low χ2 values) tomerge

• Merge performed recursively, until a predefined stoppingcondition

Discretization by Classification & CorrelationAnalysis• Classification (e.g., decision tree analysis)

• Supervised: Given class labels, e.g., cancerous vs. benign• Using entropy to determine split point (discretization point)• Top-down, recursive split

• Correlation analysis (e.g., Chi-merge: χ2-baseddiscretization)

• Supervised: use class information• Bottom-up merge: find the best neighboring intervals (those

having similar distributions of classes, i.e., low χ2 values) tomerge

• Merge performed recursively, until a predefined stoppingcondition

• Classification (e.g., decision tree analysis)• Supervised: Given class labels, e.g., cancerous vs. benign• Using entropy to determine split point (discretization point)• Top-down, recursive split

• Correlation analysis (e.g., Chi-merge: χ2-baseddiscretization)

• Supervised: use class information• Bottom-up merge: find the best neighboring intervals (those

having similar distributions of classes, i.e., low χ2 values) tomerge

• Merge performed recursively, until a predefined stoppingcondition

67

Page 68: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining Use Case)Chapter 5 (Naïve Bayes Classification I)

• Dataset: crx.data• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukan padadataset tersebut!

• Bandingkan akurasi model apabila tidakmenggunakan filter dan diskretisasi

• Bandingkan pula apabila digunakan featureselection (wrapper) dengan BackwardElimination

Latihan

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining Use Case)Chapter 5 (Naïve Bayes Classification I)

• Dataset: crx.data• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukan padadataset tersebut!

• Bandingkan akurasi model apabila tidakmenggunakan filter dan diskretisasi

• Bandingkan pula apabila digunakan featureselection (wrapper) dengan BackwardElimination

• Lakukan eksperimen mengikuti buku MarkusHofmann (Rapid Miner - Data Mining Use Case)Chapter 5 (Naïve Bayes Classification I)

• Dataset: crx.data• Analisis metode preprocessing apa saja yang

digunakan dan mengapa perlu dilakukan padadataset tersebut!

• Bandingkan akurasi model apabila tidakmenggunakan filter dan diskretisasi

• Bandingkan pula apabila digunakan featureselection (wrapper) dengan BackwardElimination

68

Page 69: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

69

Page 70: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Hasil

NB NB+Filter

NB+Discretization

NB+Filter+Discretization

NB+Filter+Discretization +Backward Elimination

Accuracy 85.79 86.26

AUC

70

Page 71: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

3.5 Data Integration

71

Page 72: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Data integration:• Combines data from multiple sources into a coherent store

• Schema Integration: e.g., A.cust-id B.cust-#• Integrate metadata from different sources

• Entity Identification Problem:• Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton

• Detecting and Resolving Data Value Conflicts• For the same real world entity, attribute values from

different sources are different• Possible reasons: different representations, different

scales, e.g., metric vs. British units

Data Integration• Data integration:

• Combines data from multiple sources into a coherent store

• Schema Integration: e.g., A.cust-id B.cust-#• Integrate metadata from different sources

• Entity Identification Problem:• Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton

• Detecting and Resolving Data Value Conflicts• For the same real world entity, attribute values from

different sources are different• Possible reasons: different representations, different

scales, e.g., metric vs. British units

• Data integration:• Combines data from multiple sources into a coherent store

• Schema Integration: e.g., A.cust-id B.cust-#• Integrate metadata from different sources

• Entity Identification Problem:• Identify real world entities from multiple data sources,

e.g., Bill Clinton = William Clinton

• Detecting and Resolving Data Value Conflicts• For the same real world entity, attribute values from

different sources are different• Possible reasons: different representations, different

scales, e.g., metric vs. British units

72

Page 73: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Redundant data occur often when integration ofmultiple databases

• Object identification: The same attribute or object mayhave different names in different databases

• Derivable data: One attribute may be a “derived”attribute in another table, e.g., annual revenue

• Redundant attributes may be able to be detectedby correlation analysis and covariance analysis

• Careful integration of the data from multiplesources may help reduce/avoid redundancies andinconsistencies and improve mining speed andquality

Handling Redundancy in Data Integration• Redundant data occur often when integration of

multiple databases• Object identification: The same attribute or object may

have different names in different databases• Derivable data: One attribute may be a “derived”

attribute in another table, e.g., annual revenue

• Redundant attributes may be able to be detectedby correlation analysis and covariance analysis

• Careful integration of the data from multiplesources may help reduce/avoid redundancies andinconsistencies and improve mining speed andquality

• Redundant data occur often when integration ofmultiple databases

• Object identification: The same attribute or object mayhave different names in different databases

• Derivable data: One attribute may be a “derived”attribute in another table, e.g., annual revenue

• Redundant attributes may be able to be detectedby correlation analysis and covariance analysis

• Careful integration of the data from multiplesources may help reduce/avoid redundancies andinconsistencies and improve mining speed andquality

73

Page 74: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Χ2 (chi-square) test

• The larger the Χ2 value, the more likely the variables arerelated

• The cells that contribute the most to the Χ2 value arethose whose actual count is very different from theexpected count

• Correlation does not imply causality• # of hospitals and # of car-theft in a city are correlated• Both are causally linked to the third variable: population

Correlation Analysis (Nominal Data)

Expected

ExpectedObserved 22 )(

• Χ2 (chi-square) test

• The larger the Χ2 value, the more likely the variables arerelated

• The cells that contribute the most to the Χ2 value arethose whose actual count is very different from theexpected count

• Correlation does not imply causality• # of hospitals and # of car-theft in a city are correlated• Both are causally linked to the third variable: population

Expected

ExpectedObserved 22 )(

• Χ2 (chi-square) test

• The larger the Χ2 value, the more likely the variables arerelated

• The cells that contribute the most to the Χ2 value arethose whose actual count is very different from theexpected count

• Correlation does not imply causality• # of hospitals and # of car-theft in a city are correlated• Both are causally linked to the third variable: population

74

Page 75: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Χ2 (chi-square) calculation (numbers in parenthesisare expected counts calculated based on the datadistribution in the two categories)

• It shows that like_science_fiction and play_chessare correlated in the group

Chi-Square Calculation: An Example

Play chess Not play chess Sum (row)

Like science fiction 250(90) 200(360) 450

Not like science fiction 50(210) 1000(840) 1050

• Χ2 (chi-square) calculation (numbers in parenthesisare expected counts calculated based on the datadistribution in the two categories)

• It shows that like_science_fiction and play_chessare correlated in the group

Not like science fiction 50(210) 1000(840) 1050

Sum(col.) 300 1200 1500

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

• Χ2 (chi-square) calculation (numbers in parenthesisare expected counts calculated based on the datadistribution in the two categories)

• It shows that like_science_fiction and play_chessare correlated in the group

75

93.507840

)8401000(

360

)360200(

210

)21050(

90

)90250( 22222

Page 76: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Correlation coefficient (also called Pearson’s productmoment coefficient)

where n is the number of tuples, and are the respectivemeans of A and B, σA and σB are the respective standard deviation ofA and B, and Σ(aibi) is the sum of the AB cross-product

• If rA,B > 0, A and B are positively correlated (A’s valuesincrease as B’s). The higher, the stronger correlation

• rA,B = 0: independent; rAB < 0: negatively correlated

Correlation Analysis (Numeric Data)

• Correlation coefficient (also called Pearson’s productmoment coefficient)

where n is the number of tuples, and are the respectivemeans of A and B, σA and σB are the respective standard deviation ofA and B, and Σ(aibi) is the sum of the AB cross-product

• If rA,B > 0, A and B are positively correlated (A’s valuesincrease as B’s). The higher, the stronger correlation

• rA,B = 0: independent; rAB < 0: negatively correlated

BA

n

i ii

BA

n

i iiBA n

BAnba

n

BbAar

)1(

)(

)1(

))((11

,

A B

• Correlation coefficient (also called Pearson’s productmoment coefficient)

where n is the number of tuples, and are the respectivemeans of A and B, σA and σB are the respective standard deviation ofA and B, and Σ(aibi) is the sum of the AB cross-product

• If rA,B > 0, A and B are positively correlated (A’s valuesincrease as B’s). The higher, the stronger correlation

• rA,B = 0: independent; rAB < 0: negatively correlated

76

Page 77: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

Visually Evaluating Correlation

Scatter plotsshowing the

similarityfrom –1 to 1

77

Page 78: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Correlation measures the linear relationshipbetween objects

• To compute correlation, we standardize dataobjects, A and B, and then take their dot product

Correlation

• Correlation measures the linear relationshipbetween objects

• To compute correlation, we standardize dataobjects, A and B, and then take their dot product

)(/))((' AstdAmeanaa kk

)(/))((' BstdBmeanbb kk

78

)(/))((' BstdBmeanbb kk

''),( BABAncorrelatio

Page 79: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• Covariance is similar to correlation

where n is the number of tuples, and are the respective mean orexpected values of A and B, σA and σB are the respective standarddeviation of A and B

• Positive covariance: If CovA,B > 0, then A and B both tend to be larger thantheir expected values

• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B islikely to be smaller than its expected value

• Independence: CovA,B = 0 but the converse is not true:• Some pairs of random variables may have a covariance of 0 but are not

independent. Only under some additional assumptions (e.g., the data followmultivariate normal distributions) does a covariance of 0 imply independence

Covariance (Numeric Data)

• Covariance is similar to correlation

where n is the number of tuples, and are the respective mean orexpected values of A and B, σA and σB are the respective standarddeviation of A and B

• Positive covariance: If CovA,B > 0, then A and B both tend to be larger thantheir expected values

• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B islikely to be smaller than its expected value

• Independence: CovA,B = 0 but the converse is not true:• Some pairs of random variables may have a covariance of 0 but are not

independent. Only under some additional assumptions (e.g., the data followmultivariate normal distributions) does a covariance of 0 imply independence

A B

Correlation coefficient:

• Covariance is similar to correlation

where n is the number of tuples, and are the respective mean orexpected values of A and B, σA and σB are the respective standarddeviation of A and B

• Positive covariance: If CovA,B > 0, then A and B both tend to be larger thantheir expected values

• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B islikely to be smaller than its expected value

• Independence: CovA,B = 0 but the converse is not true:• Some pairs of random variables may have a covariance of 0 but are not

independent. Only under some additional assumptions (e.g., the data followmultivariate normal distributions) does a covariance of 0 imply independence

79

Page 80: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,

10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise

or fall together?

• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0

Covariance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,

10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise

or fall together?

• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,

10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise

or fall together?

• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4

• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6

• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0

80

Page 81: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

1. Data quality: accuracy, completeness,consistency, timeliness, believability,interpretability

2. Data cleaning: e.g. missing/noisy values, outliers3. Data reduction

• Dimensionality reduction• Numerosity reduction

4. Data transformation and data discretization• Normalization

5. Data integration from multiple sources:• Entity identification problem• Remove redundancies• Detect inconsistencies

Rangkuman1. Data quality: accuracy, completeness,

consistency, timeliness, believability,interpretability

2. Data cleaning: e.g. missing/noisy values, outliers3. Data reduction

• Dimensionality reduction• Numerosity reduction

4. Data transformation and data discretization• Normalization

5. Data integration from multiple sources:• Entity identification problem• Remove redundancies• Detect inconsistencies

1. Data quality: accuracy, completeness,consistency, timeliness, believability,interpretability

2. Data cleaning: e.g. missing/noisy values, outliers3. Data reduction

• Dimensionality reduction• Numerosity reduction

4. Data transformation and data discretization• Normalization

5. Data integration from multiple sources:• Entity identification problem• Remove redundancies• Detect inconsistencies

81

Page 82: Data Mining: 3. Persiapan Data - amutiara.staff.gunadarma ...amutiara.staff.gunadarma.ac.id/Downloads/files/66343/03-persiapan.pdf · Uji beda dengan t-Test untuk mendapatkan model

1. Jiawei Han and Micheline Kamber, Data Mining: Concepts andTechniques Third Edition, Elsevier, 2012

2. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: PracticalMachine Learning Tools and Techniques 3rd Edition, Elsevier, 2011

3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data MiningUse Cases and Business Analytics Applications, CRC Press Taylor &Francis Group, 2014

4. Daniel T. Larose, Discovering Knowledge in Data: an Introductionto Data Mining, John Wiley & Sons, 2005

5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MITPress, 2014

6. Florin Gorunescu, Data Mining: Concepts, Models andTechniques, Springer, 2011

7. Oded Maimon and Lior Rokach, Data Mining and KnowledgeDiscovery Handbook Second Edition, Springer, 2010

8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advancesin Data Mining of Enterprise Data: Algorithms and Applications,World Scientific, 2007

Referensi1. Jiawei Han and Micheline Kamber, Data Mining: Concepts and

Techniques Third Edition, Elsevier, 20122. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical

Machine Learning Tools and Techniques 3rd Edition, Elsevier, 20113. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining

Use Cases and Business Analytics Applications, CRC Press Taylor &Francis Group, 2014

4. Daniel T. Larose, Discovering Knowledge in Data: an Introductionto Data Mining, John Wiley & Sons, 2005

5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MITPress, 2014

6. Florin Gorunescu, Data Mining: Concepts, Models andTechniques, Springer, 2011

7. Oded Maimon and Lior Rokach, Data Mining and KnowledgeDiscovery Handbook Second Edition, Springer, 2010

8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advancesin Data Mining of Enterprise Data: Algorithms and Applications,World Scientific, 2007

1. Jiawei Han and Micheline Kamber, Data Mining: Concepts andTechniques Third Edition, Elsevier, 2012

2. Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: PracticalMachine Learning Tools and Techniques 3rd Edition, Elsevier, 2011

3. Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data MiningUse Cases and Business Analytics Applications, CRC Press Taylor &Francis Group, 2014

4. Daniel T. Larose, Discovering Knowledge in Data: an Introductionto Data Mining, John Wiley & Sons, 2005

5. Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MITPress, 2014

6. Florin Gorunescu, Data Mining: Concepts, Models andTechniques, Springer, 2011

7. Oded Maimon and Lior Rokach, Data Mining and KnowledgeDiscovery Handbook Second Edition, Springer, 2010

8. Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advancesin Data Mining of Enterprise Data: Algorithms and Applications,World Scientific, 2007

82