pendahuluan data mining
DESCRIPTION
Pendahuluan data miningTRANSCRIPT
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 1/32
Data Mining:Konsep dan Teknik
October 10, 2012 Data Mining: Konsep Dan Teknik 11
— Bab 1 —
Syahril Efendi, S.Si., MIT
Departemen Matematika & Departemen Ilmu Komputer
FMIPA USU
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 2/32
Bab 1. Pengenalan
Kenapa Data Mining?
Apa itu Data Mining?
Pandangan Multi-Dimensional dari Data Mining
Macam data apa dapat ditambang?
October 10, 2012 Data Mining: Concepts and Techniques 2
Macam-macam pola apa dapat ditambang?
Teknologi apa yang digunakan?
Macam aplikasi apa yang ditargetkan? Isu-isu utama dalam Data Mining
Laporan singkat Histori Data Mining dan Masyarakat Data Mining
Kesimpulan
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 3/32
Kenapa Data Mining?
Ledakan Pertumbuhan data : dari terabytes sampai petabytes
Pengumpulan data dan Ketersediaan data
Perkakas pengumpulan data otomatis, sitem database, Web, masyarakat
komputerisasi
Sumber-sumber Utama dari data berlimpah
Bisnis: Web, e-commerce, transactions, stocks, …
October 10, 2012 Data Mining: Concepts and Techniques 3
Sain: Remote sensing, bioinformatics, scientific simulation, …
Society : Berita, camera digital, YouTube
Kita tenggelam dalam data tapi lapar Pengetahuan
Kebutuhan adalah induk dari penemuan “Necessity is the mother of invention” Data mining:Analisis otomatis dari himpunan segerombolan data
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 4/32
Evolusi dari Sain
Sebelum 1600, Ilmu Empiris (empirical science)
1600-1950, Ilmu teoritikal (theoretical science)
Setiap disiplin ilmu memiliki pertumbuhan komponen teoritikal. Model-model
teoritikal kerap kali termotivasi dari pengalaman dan digeneralisasi pemahamannya.
1950-1990, Ilmu Komputasional (computational science)
Lebih 50 tahun terakhir, Beberapa disiplin memiliki tiga pertumbuhan, cabang
komputasional (misalnya: empiris, teoritikal, dan ekologi komputasional, atau
October 10, 2012 Data Mining: Concepts and Techniques 4
p ys , a au ngu s .
Simulasi Ilmu komputasional secara tradisional. Pertumbuhannya tidak dapat
menemukan bentuk solusi model matematika kompleks.
1990-Sekarang, Ilmu data (data science)
Banjir data dari instrumen dan simulasi ilmu-ilmu baru
Kemampuan penyimpanan secara ekonomi dan manajemen data online (petabytes)
Internet dan jaringan komputasi yang dapat diakses mendapatkan arsip-arsip secara
universal
Scientific info. management, acquisition, organization, query, and visualization tasks
scale selalu linier dengan volume data. Data mining adalah tantangan utama baru!
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 5/32
Evolusi Teknologi Database
1960s:
Pengumpulan Data, Pembentukan database, IMS dan jaringan DBMS
1970s:
model data Relasional, implementation DBMS relasional
1980s:
RDBMS, model data lanjutan(extended-relational, OO, deductive, dll.)
October 10, 2012 Data Mining: Concepts and Techniques 5
, , , .
1990s:
Data mining, data warehousing, multimedia databases, dan Web databases
2000s
Stream data management and mining Data mining dan aplikasinya
Teknologi Web(XML, integrasi data) dan sistem informasi global
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 6/32
Apa itu Data Mining?
Data mining (knowledge discovery from data)
Ekstraksi kepentingan(non-trivial, implisit, sebelumnya tak diketahui
dan bermanfaat secara potensial) pola-pola atau pengetahuan dari
jumlah data yang besar
Data mining: istilah tak cocok atau nama yang salah (a misnomer)?
October 10, 2012 Data Mining: Concepts and Techniques 6
Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Tampilan : berubah jadi “data mining”? Pencarian sederhana dan pemrosesan query
(Deduktif) sistem pakar
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 7/32
Knowledge Discovery (KDD) Process
Ini adalah pandangan typikalsistem database dan komunitidata warehousing
Peran data mining penting dalamproses penemuan pengetahuan(knowledge discovery)
Data Mining
Pattern Evaluation
October 10, 2012 Data Mining: Concepts and Techniques 7
Data Cleaning
Data Integration
Databases
Data Warehouse
as -re evan a a
Selection
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 8/32
Contoh : Kerangka Web Mining
Web mining biasanya meminta
Pencucian data (Data cleaning)
Integrasi data dari banyak sumber
sebuah database untuk penyimpanan data (Warehousing the data)
Konstruksi Data cube
October 10, 2012 Data Mining: Concepts and Techniques 8
Seleksi data untuk data mining
Data mining
Presentasi dari hasil-hasil penambangan
Pola-pola dan pengetahuan digunakan atau disimpan ke dalamknowledge-base
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 9/32
Data Mining dalam Kecerdasan Bisnis
Peningkatan potensial
untuk mendukung
keputusan bisnis End User
BusinessAnal st
DecisionMaking
Data Presentation
October 10, 2012 Data Mining: Concepts and Techniques 9
Data
Analyst
DBA
Visualization Techniques
Data Mining Information Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 10/32
Contoh: Mining vs. Eksplorasi Data
Kajian Kecerdasan Bisnis
Warehouse, data cube, pelaporan yang tidak banyak penambangan
Objek-objek bisnis vs. Perkakas data mining
October 10, 2012 Data Mining: Concepts and Techniques 10
onto ranta sup a : er a as too s
Presenatasi Data
Eksplorasi
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 11/32
Proses KDD: Pandangan Tipikal dari ML danStatistik
Input Data DataMining
Data Pre-Processing
Post-Processing
October 10, 2012 Data Mining: Concepts and Techniques 11
Ini ada pandangan dari mesin pembelajaran dan komuniti statistik
Integrasi data
Normalisasi
Seleksi Fitur
Reduksi Dimensin
Penemuan Pola Asosiasi & KorelasiKlasifikasi
Cluster Analisis Pencilan (Outlier)
… … … …
Evaluasi Pola
Seleksi Pola
Interpretasi Pola
Visualisasi Pola
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 12/32
Contoh : Data Mining Kedokteran
data mining Kesehatan dan kedokteran–
seringkali mengadopsi statistik dan mesin
pembelajaran
October 10, 2012 Data Mining: Concepts and Techniques 12
dan reduksi dimensi)
Klasifikasi dan/atau proses cluster
Akhir pemrosesan untuk presentasi
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 13/32
Pandangan Multi-Dimensi Data Mining
Data untuk ditambang Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal,time-series, sequence, text and web, multi-media, graphs & socialand information networks
Knowledge untuk ditambang (atau: fungsi-fungsi Data mining) Karakterisasi Diskriminasi asosiasi klasifikasi cluster trend deviasi
October 10, 2012 Data Mining: Concepts and Techniques 13
analisis pencilan (outlier), dll.
Deskriptif vs. prediktif data mining
Fungsi-fungsi Multiple/integrated dan penambangan di level multiple
Teknik-teknik utilisasi Data-intensive, data warehouse (OLAP), machine learning, statistics,
pattern recognition, visualization, high-performance, dll.
Applikasi
Retail, telecommunication, banking, fraud analysis, bio-data mining,stock market analysis, text mining, Web mining, dll.
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 14/32
Data Mining: macam-macam Data?
Aplikasi dan kumpulan data berorintasi Database
Relational database, data warehouse, transactional database
Aplikasi lanjutan dan kumpulan data lanjutan
Data streams and sensor data
Time-series data, temporal data, sequence data (incl. bio-sequences)
October 10, 2012 Data Mining: Concepts and Techniques 14
Structure data, graphs, social networks and multi-linked data
Object-relational databases
Heterogeneous databases and legacy databases
Spatial data and spatiotemporal data Multimedia database
Text databases
The World-Wide Web
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 15/32
Fungsi Data Mining: (1) Generalisasi
Integrasi Informasi dan konstruksi data warehouse
Pencucian data, transformasi, integrasi, dan model
data multidimensional
Teknologi Data cube
October 10, 2012 Data Mining: Concepts and Techniques 15
,
agregat multidimensional
OLAP (online analytical processing)
Deskripsi konsep multidimensional: Karakterisasi dandiskriminasi
Generalisasi, Meringkas (summarize), dan karakteristik
data kontras, yakni., wilayah kering vs. basah
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 16/32
Fungsi Data Mining: (2) Asosiasi dan Analisis Korelasi
Frekuensi pola-pola (atau frekuensi kumpulan item)
Apa item-item yang dibelanjakan bersama secara frekuensi
dalam pusat perbelanjaan?
Asosiasi, korelasi vs. Kasual (sebab akibat)
Tipikal aturan asosiasi
October 10, 2012 Data Mining: Concepts and Techniques 16
. , ,
kepercayaan)
Item-item diasosiasikan dengan kuat juga dikorelasikan dengan
kuat?
Bagaimana menambang pola-pola dan aturan-aturan dengan efisiendalam kumpulan data besar?
Bagaimana menggunakan pola-pola untuk klasifikasi, cluster, dan
aplikasi lain?
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 17/32
Fungsi Data Mining: (3) Klasifikasi
Klasifikasi dan prediksi label
Menbangun dasar model (fungsi) pada beberapa contoh pelatihan
Menggambarkan dan membedakan kelas-kelas atau Konsep-konsep untuk
memprediksi masa depan
Yakni., mengklasifikasi negara berdasarkan iklim (climate), atau
mengklasifikasi mobil berdasarkan jarak dan penggunaan bensin atau
October 10, 2012 Data Mining: Concepts and Techniques 17
solar
Memprediksi beberapa kelas label yang tak diketahui
Metode Tipikal
Pohon Keputusan, Klasifikasi Bayesian, support vector machines, neural
networks, Kalsifikasi berdasar aturan,Klasifikasi berdasar pola, logisticregression, …
Aplikasi Tipikal:
Deteksi kecurangan kartu kredit, Perdagangan langsung, classifying stars,
Penyebaran penyakit (diseases), web-pages, …
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 18/32
Fungsi Data Mining: (4) Anailisis Cluster
Pembelajaran yang tidak disupervisi (yakni, label kelas tak
diketahui)
Group data untuk kategori baru (yakni, cluster), misalnya.,
cluster rumah untuk menemukan pola-pola distribusi
Prinsi : Maksimumkan kesamaan dalam kelas intra-class
October 10, 2012 Data Mining: Concepts and Techniques 18
& minimumkan kesamaan antar kelas (interclass)
Banyak Metode dan aplikasi
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 19/32
Fungsi Data Mining: (5) Analisis Pencilan(Ou t l i e r )
Analisis Pencilan (Outlier)
Pencilan (Outlier): Suatu objek data yang tidak memenuhi dengan
prilaku umum data
Gangguan (Noise) atau Pengecualian (exception)? ― Satu orang
menyampah orang yang lain dapat menghargai
October 10, 2012 Data Mining: Concepts and Techniques 19
…
Berguna dalam deteksi kecurangan, analisis kejadian yang aneh
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 20/32
Time and Ordering: Analisis Polasekuensial, Trend dan Evolusi
Analisis Sekuen, trend dan evolusi
Trend, time-series, dan analisis deviasi: misalnya.,regresi dan prediksi nilai
Penambangan pola sekuensial
Misalnya, Pertama membeli camera digital,
October 10, 2012 Data Mining: Concepts and Techniques 20
Analisis periodik
Motif dan analisis sekuen biologikal
Pendekatan dan motif berurutan Analsis berbasis kesamaan
Penambangan data mengalir (streams)
Ordered, Waktu-bermacam-macam, potentially infinite,data streams
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 21/32
Analisis struktur dan jaringan
Penambangan graf (Graph mining)
Menemukan subgraf yang sering (misalnya., senayawa kimia), trees (XML),substructures (web fragments)
Analisis jaringan informasi (Information network analysis)
Jaringan sosial (Social networks): aktor (objek, node) dan hubungan (edge)
misalnya, jaringan penulis dalam CS, jaringan teroris
Jaringan Multiple heterogeneous
October 10, 2012 Data Mining: Concepts and Techniques 21
a u orang mempunya e erapa ar ngan n ormas : eman, am , emansekelas, …
Link yang membawa banyak informasi semantik: Link mining
Penamabangan web (Web mining)
Web adalah jaringan informasi besar: dari PageRank untuk Google
Analisis jaringan informasi web Penemuan komunitas Web, penambangan pendapat, penamabangan
pengguna, …
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 22/32
Evaluasi Pengetahuan
Apa pentingnya semua pengetahuan ditambang?
Satu orang mendapat pola dan pengetahuan dalam jumlah yang
besar
Some may fit only certain dimension space (time, location, …)
Some may not be representative, may be transient, …
October 10, 2012 Data Mining: Concepts and Techniques 22
Evaluation of mined knowledge → directly mine only
interesting knowledge?
Descriptive vs. predictive
Coverage
Typicality vs. novelty
Accuracy
Timeliness
…
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 23/32
Data Mining: Confluence of Multiple Disciplines
MachineLearning
StatisticsPatternRecognition
October 10, 2012 Data Mining: Concepts and Techniques 23
Data Mining Applications
Algorithm High-PerformanceComputing
Visualization
DatabaseTechnology
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 24/32
Why Confluence of Multiple Disciplines?
Tremendous amount of data (Jumlah data yg luar biasa)
Algorithms must be highly scalable to handle such as tera-bytes of data
High-dimensionality of data
Micro-array may have tens of thousands of dimensions
October 10, 2012 Data Mining: Concepts and Techniques 24
Data streams and sensor data
Time-series data, temporal data, sequence data
Structure data, graphs, social networks and multi-linked data
Heterogeneous databases and legacy databases
Spatial, spatiotemporal, multimedia, text and Web data
Software programs, scientific simulations
New and sophisticated applications
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 25/32
Applications of Data Mining
Web page analysis: from web page classification, clustering to
PageRank & HITS algorithms
Collaborative analysis & recommender systems
Basket data analysis to targeted marketing
Biological and medical data analysis: classification, cluster analysis
October 10, 2012 Data Mining: Concepts and Techniques 25
(microarray data analysis), biological sequence analysis, biological
network analysis
Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue) From major dedicated data mining systems/tools (e.g., SAS, MS SQL-
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
mining
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 26/32
Major Issues in Data Mining (1)
Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multi-dimensional space
Data mining: An interdisciplinary effort
October 10, 2012 Data Mining: Concepts and Techniques 26
Handling noise, uncertainty, and incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining
User Interaction
Interactive mining
Incorporation of background knowledge
Presentation and visualization of data mining results
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 27/32
Major Issues in Data Mining (2)
Efficiency and Scalability
Efficiency and scalability of data mining algorithms
Parallel, distributed, stream, and incremental mining methods
Diversity of data types
October 10, 2012 Data Mining: Concepts and Techniques 27
Mining dynamic, networked, and global data repositories
Data mining and society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 28/32
A Brief History of Data Mining Society
1989 IJCAI Workshop on Knowledge Discovery in Databases
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
1991)
1991-1994 Workshops on Knowledge Discovery in Databases
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
October 10, 2012 Data Mining: Concepts and Techniques 28
1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
Journal of Data Mining and Knowledge Discovery (1997)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations More conferences on data mining
PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
(2001), etc.
ACM Transactions on KDD starting in 2007
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 29/32
Conferences and Journals on Data Mining
KDD Conferences
ACM SIGKDD Int. Conf. onKnowledge Discovery inDatabases and Data Mining (KDD)
SIAM Data Mining Conf. (SDM)
(IEEE) Int. Conf. on Data Mining(ICDM)
Other related conferences
DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, …
Web and IR conferences: WWW,
SIGIR, WSDM
ML conferences: ICML, NIPS
October 10, 2012 Data Mining: Concepts and Techniques 29
European Conf. on MachineLearning and Principles andpractices of Knowledge Discoveryand Data Mining (ECML-PKDD)
Pacific-Asia Conf. on KnowledgeDiscovery and Data Mining(PAKDD)
Int. Conf. on Web Search andData Mining (WSDM)
PR conferences: CVPR,
Journals
Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge andData Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 30/32
Where to Find References? DBLP, CiteSeer, Google
Data mining and KDD (SIGKDD: CDROM) Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Database systems (SIGMOD: ACM SIGMOD Anthology —CD ROM) Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning Conferences: Machine learnin ML AAAI IJCAI COLT Learnin Theor CVPR NIPS etc.
October 10, 2012 Data Mining: Concepts and Techniques 30
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,IEEE-PAMI, etc.
Web and IR Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems,
Statistics Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 31/32
Recommended Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge
Discovery, Morgan Kaufmann, 2001
October 10, 2012 Data Mining: Concepts and Techniques 31
. an an . am er. a a n ng: oncep s an ec n ques. organ au mann, n e ., e .
2011)
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
B. Liu, Web Data Mining, Springer 2006.
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
7/17/2019 Pendahuluan Data Mining
http://slidepdf.com/reader/full/pendahuluan-data-mining 32/32
Summary
Data mining: Discovering interesting patterns and knowledge from
massive amount of data
A natural evolution of database technology, in great demand, with
wide applications
A KDD process includes data cleaning, data integration, data
October 10, 2012 Data Mining: Concepts and Techniques 32
, , , ,
knowledge presentation
Mining can be performed in a variety of data
Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
Data mining technologies and applications
Major issues in data mining