peringkasan teks
TRANSCRIPT
Peringkasan Teks OtomatisMasayu Leylia Khodra – ITB
Workshop INACL 20 Mei 2016
2
Outline• Pendahuluan• Pendekatan peringkasan manual• Tipe peringkasan otomatis• Pendekatan peringkasan otomatis• Identifikasi Unit Teks yang Penting
3
Pendahuluan• Manfaat peringkasan untuk information overload• Abstrak pada makalah ilmiah (ringkasan indikatif)• Related work pada makalah ilmiah (ringkasan makalah-makalah pendukung)• Snippet pada mesin pencari• Ringkasan kumpulan berita dalam 1 cluster• Ringkasan posting media sosial• Ringkasan review produk / opini• Tailored summary
4
Contoh Ringkasan Review (Google)
5
Ringkasan kumpulan berita
Daftar berita 1 cluster
Informasi 5W1H
6
Contoh Ringkasan Kumpulan Makalah
7
Mesin Peringkasan Teks Otomatis
?
EXTRACTS
ABSTRACTS
MULTIDOCS
Extract Abstract
Indicative
Generic
Background
Query-oriented
Update
10%
50%
100%
Very Brief Brief
Long
Headline
Informative
DOC QUERY
http://www.isi.edu/natural-language/people/{hovy.html,marcu.html}
8
Pendahuluan (lanjutan)• Salah satu track pada Document Understanding Conference (2001-
2007) & Text Analysis Conference (2008-2011, 2014)• Multi-document summarization• Query-based multi-document summarization• Update summarization• Guided summarization (update & aspect-based mds)• Biomedical summarization (Pubmed)
9
Ringkasan• Ringkasan: • pendek (<50%), • representasi akurat isi dokumen yang penting, • sesuai kebutuhan user
• Ringkasan yang baik:• Ringkasan yang disepakati setiap orang. Jika hasil ringkasan setiap orang
berbeda, artinya mereka tidak tahu apa yang dibutuhkan.• Ringkasan yang memenuhi aturan tata bahasa, kohesif, dan koheren
• Koheren: kepaduan makna. Gagasan utama didukung semua kalimat• Kohesif: kepaduan bentuk
10
Ringkasan Single doc, Ekstraktif, GenerikDokumen sumber
To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. Titles, keywords, tables-of-contents and abstracts might all be considered as forms of summary, however a document summary conventionally refers to an abstract-like condensation of a full-text document. Traditionally, document summaries are provided by the author. This paper focusses on document extracts, a particular kind of computed document summary....
Document extracts consisting of roughly 20% of the original cart be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summmaries. However, other studies [12, 2] suggest that the optimal extract can be far from unique. Numerous heuristics have been proposed to guide the selection of document extracts [7,4, 17, 14], yet no clear criterion has been proposed to choose among them. Existing evidence [4] suggests that combinations of individual heuristics have the best performance.
Ringkasan ekstraktif (Kuepic, 1995)To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original. This paper focusses on document extracts, a particular kind of computed document summary. Document extracts consisting of roughly 20% of the original cart be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summmaries.
To summarize is to reduce in complexity, and hence in length, while retaining some of the essential qualities of the original.
This paper focuses on document extracts, a particular kind of computed document summary.
Document extracts consisting of roughly 20% of the original cart be as informative as the full text of a document, which suggests that even shorter extracts may be useful indicative summmaries.
Transformasi teks sumber menjadi teks yang lebih pendek dengan memilih informasi yang penting
11
Ringkasan Multidok, Abstraktif, Generik
12
Ringkasan MultiDok, Ekstraktif, Tailored
13
Pendekatan Peringkasan: Manual (Sherrard, 1989)• Strategi immature: menghapus & mengambil bagian asli teks• Deletion rule: hapus proposisi yang tidak penting
Teks: "Mary played with a ball. The ball was blue." Ringkasan: "Mary played with a ball".• Selection rule: hapus proposisi yang dapat diinferensi dari
proposisi yang ada berdasarkan pengetahuan terhadap situasi dan kondisi normalTeks: "I went to Paris. So, I went to the station, bought a ticket, took the train ... " Ringkasan: "I went to Paris".
14
Pendekatan Peringkasan: Manual (lanjutan) • Strategi mature: membangkitkan kalimat baru• Generalization rule: mensubstitusi dengan istilah yang
umum untuk sejumlah istilah tertentu:Teks: "Mary played with a doll. Mary played with blocks." Ringkasan: "Mary played with toys".• Construction rule: konstruksi proposisi baru sehingga
proposisi yang dihapus dapat diinferensi Teks: "I went to the station, bought a t i c k e t . . . " Ringkasan: “I went by train".
15
Pendekatan Peringkasan: Manual
Strategi [Im]Mature (Sherrard, 1989)• Deletion rule• Selection rule• Generalization rule• Construction rule
Macro Rules (Brown and Day)• DELETE trivial and redundant
information;• Dilakukan anak 10 tahun
• SELECT a topic sentence already in the text;
• SUBSTITUTE a general term for a list of objects or a sequence of actions; and
• INVENT a topic sentence, if one does not already appear in the text.• Hanya dilakukan oleh expert
16
Pendekatan Peringkasan Otomatis (Many & Maybury, 1999)• Pendekatan klasik: pembobotan manual fitur dasar, dan memilih unit
teks dgn bobot tertinggi• Pendekatan berbasis koleksi: pembobotan otomatis fitur dengan
pembelajaran koleksi• Pendekatan berbasis discourse: memanfaatkan struktur discourse• Pendekatan berbasis pengetahuan: menstrukturkan teks dengan
ekstraksi informasi berdasarkan template, dan menginferensi pengetahuan untuk mendapatkan ringkasan. • Fokus pada tahapan transformasi dan sintesis
17
Issue: Identifikasi Unit Teks yang Penting [Marcu, 2002]• Metode berbasis posisi• Metode berbasis Judul atau query• Metode berbasis frase petunjuk (cue-phrase)• Metode berbasis frekuensi kata• Metode berbasis kohesi• Word-based• Lexical chains-based• Connectedness-based
• Metode berbasis Discourse• Integrasi atau kombinasi
18
Metode berbasis Posisi• Terdapat keteraturan struktur teks lokasi kalimat tertentu cenderung
mengandung info penting• Single-dok: metode terbaik di 1995-an
• Leading text > ANES berbasis word co-occurrence• Multi-dokumen (DUC 2007):
• Baseline: ambil kalimat berdasarkan urutannya untuk setiap dok• ROUGE-2: 0.06039; ROUGE-SU4: 0.10507; Content Responsiveness: 1.87 (rank 30 dari 32)
• MASC (Multiple Alternative Sentence Compressions) • Hanya compress 5 kalimat pertama setiap dokumen
• Keunggulan: sederhana, mampu mengidentifikasi 33% kalimat penting [1][2], lebih baik dari metode word-occurrences untuk koleksi artikel ensiklopedia [1]• Kelemahan: posisi tergantung pada domain
19
Metode berbasis Judul atau Query• Skor kalimat berdasarkan jumlah desirable words (title, heading, query)• Keunggulan:
• Meningkatkan performansi position-based summarizer (8%) dan cue-phrase-based summarizer (3%) [1]
• DUC 2007:• CLASSY: jumlah query terms• SVR-based: jumlah irisan named-entity query dan kalimat• LCC’s GISTexter: relevance score dari mesin retrieval kalimat• NUS: similarity(kalimat, query)
20
Metode berbasis Frase Petunjuk (cue-phrase)• Kalimat penting mengandung “bonus phrase” seperti significantly, in
this paper we show, in conclusion• Kalimat tidak penting mengandung “stigma phrase” seperti hardly,
impossible• Keunggulan:• mampu mengidentifikasi 55% kalimat penting pada scientific articles dengan
1423 phrases
21
Metode berbasis Frekuensi Kata• Kalimat penting mengandung kata-kata yang sering muncul• Digunakan oleh statistical model• Kelemahan: • Menurunkan performansi 2%-7% dari sistem lain
• DUC 2007:• CLASSY, LCC’s GIStexter: jumlah signature terms• SVR-based: tf.idf• IIIT Hyderabad: probabilitas kata-kata dalam kalimat ada di term-cluster dari
summary (NB model)
22
Cohesion-based Method• Kalimat penting merupakan entitas yang paling terhubung dalam
struktur semantik• Word-based (seperti LSA)• Lexical chains-based Method (dgn Wordnet)• Connectedness-based
• DUC 2007:• GIStexter, SVR-based: jumlah named entity dalam kalimat• NUS: time-stamped graph (sentence, similarity)• IIIT Hyderabad: term co-occurrence• IS-SUM: multi-document lexical chain
23
Metode berbasis Discourse• Hirarki struktur discourse dari teks dapat digunakan untuk
menentukan kalimat yang penting• Keunggulan: performansi hampir sebaik manusia untuk kumpulan
teks Scientific American• DUC 2007: • CLASSY, SVR, GISTexter: posisi kalimat
24
Integrasi atau Kombinasi Berbagai Metode• Tidak ada metode scoring untuk ekstraksi yang performansinya
seperti manusia kombinasikan• Eksperimen Bayesian classifier dengan fitur paragraph position, cue-
phrase indicators, word-frequency, upper-case words, dan panjang kalimat: • Precision paragraph position: 0.33• Precision cue-phrase: 0.29• Precision position+cue: 0.42• Precision kombinasi semua: 0.42
• Decision tree: C4.5 > single method
25
Extractive FrameworkText Preprocessing Sentence
Scoring
Query
set of docs
SummarySelect n-top
Remove redundancy Reordering
Text Postprocessing
Post processing
split sentence(tokenization, tag)filter dokumenfilter sentenceclusteringrouter: 1/n event
POS tagginglemmatization/stemmingstopword removalnamed entity recognitionphrase eliminationsentence compression
query term expansion
MMR, distance(query,doc) Wordnet+ontology
26
Menghilangkan Redundancy• Maximal Marginal Relevance• memaksimalkan relevansi dengan query dan memaksimalkan perbedaan
dengan text unit yang telah dipilih sebelumnya
• Greedy selection: CLASSY 2001• memaksimalkan pilihan lokal di setiap langkahnya
• Singular Value Decomposition subset: CLASSY 2002• Pivoted QR factorization: CLASSY 2001-2006• LSI + L1-norm pivoted QR : CLASSY 2007
27
Post Processing• Coreference Resolution• substitusi bentuk pendek/singkatan, pronoun dengan nama panjang.
Misalnya: “King Norodom Sihanouk” untuk “Norodom Sihanouk”, “Sihanouk”, “the king””• CLASSY 2004: coreference meningkatkan ROUGE
28
Referensi Peringkasan Teks Bahasa Indonesia• Budhi dkk.: Dijkstra dan Steepest Ascent Hill Climbing Algorithm. • A. Mirna dkk, 2006: Cue phrase dan TF.idf• G. Yapinus, A. Erwin, M. Galinium, W. Muliady (2014), Automatic Multi-Document
Summarization for Indonesian Documents Using Hybrid Abstractive-Extractive Summarization Technique, 2014 6th International Conference on Information Technology and Electrical Engineering (ICITEE), Yogyakarta, Indonesia
• P. P. Tardan, A. Erwin, K.I.Eng, W. Muliady (2013), Automatic Text Summarization Based on Semantic Analysis Approach for Documents in Indonesian Language,
• Aristoteles, Y. Herdiyeni, A. Ridha, J. Adisantoso (2012), Text Feature Weighting for Summarization of Document In Bahasa Indonesia Using Genetic Algorithm, IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 1, May 2012
29
Referensi• Carol Sherrard (1989), Teaching Students To Summarize:
Applying Textlinguistics. System. International Journal of Educational Technology and Applied Linguistics. Volume 17, Issue 1, Pages 1-163.
• Das, D. ,Martins, A.F.T. (2007) : A Survey on Automatic Text Summarization. Literature Survey for the Language and Statistics II Course at CMU.
• Edmundson, H.P. (1969) : New Methods in Automatic Extracting. Journal of the Association for Computing Machinery.
• Hovy & Marcu (1998), Automated Text summarization Tutorial — COLING/ACL’98. http://www.isi.edu/natural-language/people/{hovy.html,marcu.html}
• Hovy, E. (2003) : Text Summarization, bab 32 dari buku The Oxford Handbook of Computational Linguistics.• Jones, K. S. (2007) : Automatic Summarising: The state of the art. Information Processing and Management
43 (2007) 1449-1481, Elsevier.• Kupiec, J., dkk. (1995) : A Trainable Document Summarizer. ACM SIGIR • Marcu, D. (2003), Automatic Abstracting. Encyclopedia of Library and Information Science 2003; 245-256.
30
Terima kasih atas perhatiannya