Sunteți pe pagina 1din 72

Automatic

A t ti Text
T t
Summarization
(
(MEAD))
授課教授:柯皓仁 教授
小組成員:陳志華 9634501
陳昌民 9634508
許家偉 9634524
盧盈蓉 9634528
Overview

• T
Types off text
t t summarization
i ti
• MEAD Installation
• Process of text summarization
– Preprocess
P
– Features Selection
– Classifier
– Reranker
– Summary
• Application
A li ti
Types of text
summarization
• Single
Si l ddocumentt vs. Multiple
M lti l ddocumentt
• Domain specific
p vs. General
• Knowledge poor vs. Knowledge rich
• Extract vs. Abstract
MEAD Installation

• Check
Ch k th
the Kernel
K l version
i off OS
• Check the Perl Environment
• Download MEAD
• Install External Software
S f
• Install MEAD
• Set Environment Variables
MEAD Installation (cont )
(cont.)

• MEAD Packet
– MEAD 3.11 (newest)
– http://www.summarization.com/mead/

• Operation Environment
– Solaris 5
5.7
7
– Linux (kernel 2.2, 2.4 and 2.6)

• Program
g Language
g g
– Perl 5.5 or above
MEAD Installation (cont )
(cont.)

• Perl
P l5 5.5
5 IInstallation
t ll ti
MEAD Installation (cont )
(cont.)

• External
E t l Software
S ft (Required)
(R i d)
– Expat
– XML::Parser
– XML::Writer
– XML::TreeBuilder
– Text::Iconv
T I
MEAD Installation (cont )
(cont.)

• Expat
E t Installation
I t ll ti (Required)
(R i d)
– http://sourceforge.net/projects/expat/
• Command
– ./configure
/configure
– make install
MEAD Installation (cont )
(cont.)

• XML::Parser
XML P IInstallation
t ll ti (R(Required)
i d)
– http://www.cpan.org/authors/id/C/CO/COOP
ERCL/XML-Parser.2.30.tar.gz
• Command
– perl MakeFile.PL
– make
k iinstall
t ll
MEAD Installation (cont )
(cont.)

• XML::Writer
XML W it IInstallation
t ll ti (R(Required)
i d)
– http://www.cpan.org/authors/id/DMEGG/XM
L-Writer-0.4.tar.gz
• Command
– perl MakeFile.PL
– make
k iinstall
t ll
MEAD Installation (cont )
(cont.)

• XML::TreeBuilder
XML T B ild IInstallation
t ll ti (R
(Required)
i d)
– http://www.cpan.org/authors/id/S/SB/SBURK
E/XML-TreeBuilder-3.08.tar.gz
• Command
– perl MakeFile.PL
– make
k iinstall
t ll
MEAD Installation (cont )
(cont.)

• Text::Iconv
T t I Installation
I t ll ti (Required)
(R i d)
– http://search.cpan.org/CPAN/authors/id/M/M
P/MPOITR/Text-Iconv-1.3.tar.gz
• Command
– perl MakeFile.PL
– make
k iinstall
t ll
MEAD Installation (cont )
(cont.)

• External
E t l Software
S ft (Optional)
(O ti l)
– Support Vector Machines (SVM)
– LT-XML
MEAD Installation (cont )
(cont.)

• SVM IInstallation
t ll ti (O(Optional)
ti l)
– http://www.cs.cornell.edu/People/tj/svm_light
• Command
– make all
MEAD Installation (cont )
(cont.)

• LT-XML
LT XML IInstallation
t ll ti (O(Optional)
ti l)
– http://www.ltg.ed.ac.uk/software/xml/index.ht
ml
• Command
– su
– ./configure
/ fi
– make install
MEAD Installation (cont )
(cont.)
• Update /etc/environment
• Set Environment Variables
–$$MEAD_DIR = the result of the unpacking
p g
process.
– $BIN_DIR = $MEAD_DIR/bin
– $SCRIPTS_DIR = $BIN_DIR/feature-scripts
– $LIB_DIR
$LIB DIR = $MEAD_DIR/lib
$MEAD DIR/lib
– $DOCS_DIR = $MEAD_DIR/docs
– $DATA_DIR
$DATA DIR = $MEAD
$MEAD_DIR/data
DIR/data
– $DTD_DIR = $MEAD_DIR/dtd
– $ETC_DIR
$ETC DIR = $MEAD_DIR/etc
$MEAD DIR/ t
– $USER_DIR = $MEAD_DIR/user
Process of text
summarization
• Preprocess
P
• Feature Selection
• Classifier
• Reranker
• Summary
• Evaluation
Process of text
summarization (cont.)
Process of text
summarization (cont.)
• Use
U Terminal
T i l

• Go to $BIN_DIR
– /mead/bin/
/ d/bi /

• Execute mead.pl to extract the summary


from the default document (GA3)
– ./mead.pl GA3
Process – Preprocess

• T
Transfer
f the
th format
f t off original
i i ld documents
t
into the MEAD format
• Set the Cluster of documents used
MEAD format
Process – Preprocess (cont.)
(cont )

• Document
D t used
d MEAD fformatt
Process – Preprocess (cont.)
(cont )

• Command
C d
– perl
mead/bin/addons/formatting/text2cluster.pl
Document_file
Process – Feature Selection

• D
Describe
ib th
the weight
i ht off sentences
t in
i each
h
document
• Calculate the vector of each sentences
Process – Feature Selection (cont.)
(cont )

• Related Work
– 字詞頻率(Term Frequency) 15
• [1, 2, 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 18, 19,
20]
– 向心性(Centrality) 9
• [3, 4, 5, 6, 8, 12, 18, 19, 20]
– 段落位置(Position) 9
• [1, 3, 7, 8, 9, 12, 16, 19, 20]
– 語句長度(Sentence
語句長度(S t th) 6
Length)
L
• [1, 2, 3, 7, 9, 14]
Process – Feature Selection (cont.)
(cont )

• Related
R l t dWWork
k
– 提示片語(Cue-Phrase) 5
• [2, 7, 12, 13, 14]
– 與標題相似度(Resemblance to Title) 5
• [8, 9, 12, 16, 19]
– 負面關鍵詞(Negative Words) 4
• [8, 10, 12, 19]
– 大寫字詞(Uppercase
大寫字詞(U W
Words)
d )3
• [2, 7, 14]
Process – Feature Selection (cont.)
(cont )

• Features
F t off MEAD
– Length (default)
– Centroid (default)
– Position (default)
– QueryConsine
– QueryConsineNoIDF
Q C i N IDF
– QueryWordOverlap
– SimWithFirst
– QueryPhraseMatch
– LexRank
Process – Feature Selection (cont.)
(cont )

• How
H tto code?
d ?
– /mead/bin/feature-scripts/
– Add new feature codes using perl
– Modify old feature codes using perl
• Feature Example
– Position (Old): score = sqrt(1 / T)
– Position (New): score = (N – T + 1) / N
– Arguments
• N: the number of sentences of the document
• T: the T-th sentence of the document
Process – Feature Selection (cont.)
(cont )

• 程式撰寫
Process – Feature Selection (cont.)
(cont )

• 程式撰寫

←Insert the code

←Modify the code


Process – Feature Selection (cont.)
(cont )

• Feature
F t
Process – Classifier

• S
Select
l t ffeatures
t
• Set the weight
g of those features
• Summarize those features and their
weight to calculate the score of each
Process – Classifier (cont.)
(cont )

• Related
R l t dWWork
k
– 貝式定理(Bayesian Rule) 4
• [7, 11, 13, 15]
– 基因演算法(Genetic Algorithm) 2
• [8, 19]
– 決策樹(Decision Tree) 2
• [10, 15]
– 類神經網路(Neural
類神經網路(N lNNetworks)
t k )1
• [9]
– 支持向量機(Support
支持向 機 Vector Machine) 1
• [10]
Process – Classifier (cont.)
(cont )

• Command
C d
– ./mead.pl –classifier “classifier_type
feature1_type feature1_weight
feature2 type feature2_weight
feature2_type feature2 weight …”
Document_Name
– ./mead.pl
/mead pl –classifier
classifier “./default-classifier.pl
/default classifier pl
Length 12 Centroid 1 Position 0.5”
–scores
scores GA3
Process – Classifier (cont.)
(cont )

• Command
C d
– ./mead.pl –classifier “./default-classifier.pl
Length 12 Centroid 1 Position 0.5” –scores
GA3
Process – Reranker

• Judge
J d ththe correlation
l ti between
b t sentences
t
• Decrease redundancy y
• Set Compression
Process – Reranker (cont.)
(cont )

• Command
C d
– ./mead.pl –reranker “reranker_type
SimFunction
ThresholdValue
IDFName” Document_name
– ./mead.pl
p –reranker “./default-reranker.pl
p
MEAD-cosine
0.7
enidf” –scores GA3
Process – Reranker (cont.)
(cont )
Process – Reranker (cont.)
(cont )

• Extract
E t t
Process – Summary

• G
Gett the
th .extract
t t from
f Reranker
R k
• Mapp the extract to summaryy
Process – Summary (cont.)
(cont )

• Command
C d
– ./extract-to-summary.pl
cluster_file
docsent_dir
extract_file
– ./extract-to-summary.pl
yp
../data/GA3/GA3.cluster
../data/GA3/docsent
../data/GA3/GA3.20.extract
Process – Summary (cont.)
(cont )

• 操作
Process – Evaluation

• Measure
M th
the performance
f off a system
t
– Effectiveness of results
– Satisfaction of the user
Process – Evaluation (cont.)
(cont )

• Related
R l t dWWork
k
– Recall 13
• [1, 2, 4, 7, 8, 10, 11, 13, 14, 15, 17, 18, 19]
– Precision 13
• [1, 2, 4, 8, 10, 11, 12, 13, 14, 15, 17, 18, 19]
– F-Measure 7
• [2, 8, 10, 11, 14, 15, 19]
– ROUGE 2
• [6, 20]
Process – Evaluation (cont.)
(cont )

• Evaluations
E l ti off MEAD
– MEAD Eval
• Precision
• Recall
– Rouge
– Lexical Similarity
Process – Evaluation (cont.)
(cont )

• Rouge
R
• Generate auto_temp.xml
p
– ./rouge.pl summary
manual summary1
manual_summary1
manual_summary2
– ./rouge.pl
/rouge pl
example.summary
D1101 M 100 T W
D1101.M.100.T.W
D1101.M.100.T.X
D1101 M 100 T Y
D1101.M.100.T.Y
D1101.M.100.T.Z
Process – Evaluation (cont.)
(cont )

• R
Rouge
• Use ROUGE-1.5.5.pl
p and auto_temp.xml
p
to evaluate
– ./ROUGE-1.5.5.pl
/ROUGE 1 5 5 pl
–e $DATA
–nn2
–a
auto temp xml
auto_temp.xml
Process – Evaluation (cont.)
(cont )
Process – Evaluation (cont.)
(cont )

• ./meadeavl.pl
/ d l l
MEAD_Extract
Manual_Extract

• ./meadeavl.pl
$DATA_DIR/GA3/GA3.10.extract
_
$DATA_DIR/GA3/GA3.20.extract

• Precision
P i i and d Recall
R ll
Process – Feature Selection (cont.)
(cont )

• 程式撰寫
Process – Evaluation (cont.)
(cont )

• F
F-Measure
M
= 2 / [(1/precision) + (1/recall)]
= (2*precision*recall) / (precision*recall)
Process – Feature Selection (cont.)
(cont )

• 程式撰寫

←Insert the code


←Modify
M dif th
the code
d
←Insert the code
Process – Feature Selection (cont.)
(cont )

• 程式撰寫
Process – Other

• Command-line
Command line options to mead
mead.pl
pl
• -help
– Print help information and exit
– ./mead.pl
p -helpp
• -summary (this is the default)
– Produce a summary
• -extract
– Produce a extract
• -output_mode
p _ mode
– Mode can either “summary” or “extract”
Process – Other (cont.)
(cont )

• -sentences
sentences (this is the default)
– Percentage of the number of sentences
• -word
– Percentage
g of the number of words
• -percent num (this is the default)
– Produce a summary whose length is num%
the length of original cluster
• -absolute
absolute num
– Produce a summary whose length is num
(words/sentences) regardless of the size of
original cluster
Process – Other (cont.)
(cont )

• -system
t RANDOM
– Produce a random summary
• -system LEADBASED
– Produce a lead-based
lead based summary
• -feature_cache_policy policy
– Policy can be either “keep” or “delete”
– The “delete”
delete option forces MEAD to
recompute all features
– “keep”
keep is the default setting
Process – Other (cont.)
(cont )

• -feature
f t name [-recompute]
[ t ] commandline
dli
• -classifier commandline
• -reranker commandline
• -meadrc rcfile
f
Process – Other (cont.)
(cont )

• S
Sett rcfile
fil or config
fi
• Sample.rc
p
– feature Length ./feature-scripts/Length.pl
– feature Position ./feature-scripts/Position.pl
/feature scripts/Position pl
– feature Centroid ./feature-scripts/ Centroid.pl
– classifier ./default-classifier.pl Length 9
Position 1 Centroid 1
– reranker ./default- reranker.pl MEAD-cosine
0.9 enidf
– percent 20
Process – Other (cont.)
(cont )

• Sample.config
S l fi
Application
• Situation-Aware
Situation Aware U
U-Learning
Learning System (SAULS)

• Ubiquitous Learning
– 多元化的學習設備和裝置
– 隨時隨地的學習方式

• 坡地防災及水資源工程研究所
坡 防災及水資源 程研 所
– 臺灣災害日益嚴重(天然災害、人為災害)
– 2006年 5月豪雨與颱風(碧利斯、凱米、寶發)重創屏
東縣各區域
– 2005年0612超大豪雨水淹屏東縣各鄉鎮市
Application (cont )
(cont.)

Mobile users Model base server

L
Location-Aware

T user interface aggent


The

M
Multimedia
H
HTTP
T Summarizatio
Text

WSN monitor ageent


Rainfall Info. agen

Coordinating agen
Learner Portfolio

Real-time moniitor

VR of multimed
Geographical

WSN Gatewaay
Push agent
Info. agent
PageRank

WEB Servicee
RFID Readerr
Streaming Service
Agent

Web GIS
agent
Notebook Bayesian Classification

Transm
PDA
Euclidean Distance

ag

mission
S
os

gent

nt
on

dia
nt

Dialog Unit
Smart Phone
Convex Hull
Intelligent agent
Spatial Interpolation
PC Tablet PC Integrated Service Interface The-shortest Path
Customized Interface
Multimedia server S
Smooth
hEEscape Path
P h

Reasoning Engine
Location-B
multimediia transmitted

Adaptablee Escape Path


Customiized service

VR of debris-flow
Reall-time of

Active database server


V of
VR
d

Based Service

Real-time Ecology Disaster info.


Satellite
RFID Reader Monitoring Sensors Geographical info.
Firewall
Temperature Environment info.

Diialog Unit
WSN Gateway
Heterogeneous Networks GPS Module Learning content
Humidity
Router Rainfall Situation-Aware
information Soil moisture
System info.

Mobile Networks I
Internet Illumination M l i di file
Multimedia fil
Network IP
Camera
Learning Assistant
Virtual Reality of Sediment Disaster and Ecology Monitoring Tutorial

Digital
g Archives Online Tutorial
Ecology Monitoring Sense
Sediment Disaster
Digital Archives
Report of Complex Arguments
Mapping Digital Archives
Virtual Reality of
Simulation

E-Learning Management Platform

Learning Content Modules Text Summzarization

Mobile
Learning Map
Knowledge-based Blog

Interactive Response System Real-Time


Multicast Chat Room

Campus Life
Life Quality Management Helper

Hospitality Decision Support System

Practice Product Presentation

Emergency Online Helper

Healthy Learning Plan

Dormitory Safe Monitor


Application (cont )
(cont.)

• Process
P off Text
T t Summarization
S i ti
D b
Database S
Server

2. Secondary Document 5. Secondary Summary


MEAD Server

MEAD module
Multimedia Server

Text Summarization Agent Summary module


Preprocess
Code process Feature Selection
1. Origin Document ` → ' Classifier
“ → " Reranker
6. Secondary Summary SQL Injection Summary
Mobile Users
3. Origin Document
Java Socket Client Java Socket Server
4. Origin Summary
Application (cont )
(cont.)
Application (cont )
(cont.)
Application (cont )
(cont.)
Application (cont )
(cont.)
Future Work
• Ontology
O t l
– Domain Knowledge
– Semantic (NLP)

• Web Mining
– Queryy based
– Topic Oriented

• Multimedia
– Voice
V i
– Video
References
• [1] C. Aone, M. E. Okurowski, J. Gorlinsky, and B. Larsen (1999), “A A
trainable Summarizer with Knowledge Acquired from Robust NLP
Techniques”, In I. Mani and M. Maybury (eds), Advances in Automated
Text Summarization, MIT Press, pp. 71-80, 1999.
• [2] C.C W.
W Wu and C.C L.
L Liu (2003)
(2003), “Ontology
Ontology-based
based text summarization
for business news articles”, Proceedings of the ISCA Eighteenth
International Conference on Computers and Their Applications
(CATA'03), Honolulu, Hawaii, USA, pp. 389-392, 2003.
• [3] D. R. Radev, A. Winkel, and M. Topper (2002), “Multi Document
Centroid-based Text Summarization”, In ACL 2002, Philadelphia, PA,
2002.
• [4] D.D R.
R Radev,
Radev H.H Jing,
Jing MM. Budzikowska (2000)
(2000), “Centroid-based
Centroid based
summarization of multiple documents: sentence extraction, utility-based
evaluation, and user stydies”, ANLP/NAACL 2000 Workshop, pp. 21-29,
2000.
• [5] D.D R.
R Radev,
Radev V.V Hatzivassiloglou,
Hatzivassiloglou and KK. R
R. McKeown (1999)
(1999), “A
A
description of the CIDR system as used for TDT-2”, In DARPA Broadcast
News Workshop, Herndon, VA, 1999.
• [[6]] G. Erkan and D. R. Radev (2004),
( ) “LexPageRank:
g Prestige
g in Multi-
Document Text Summarization”, Proceedings of EMNLP 2004, Barcelona,
Spain, PP. 365-371, 2004.
References
• [7] J.
J Kupiec,
Kupiec J.
J Pedersen,
Pedersen and F F. Chen (1995)
(1995), “A
A Trainable Document
Summarizer”, In SIGIR, ACM, Seattle WA, USA, 1995.
• [8] J. Y. Yeh, H. R. Ke, W. P. Yang, and I. H. Meng (2005), “Text
Summarization Using a Trainable Summarizer and Latent Semantic
Analysis”, Information Processing and Management, Vol. 41, No. 1, pp.
75-95, 2005.
• [10] L.
L W.
W Ku,
Ku Y.
Y T.
T Liang,
Liang and HH. HH. Chen (2006)
(2006). “Opinion extraction
extraction,
summarization and tracking in news and blog corpora”, AAAI-2006 Spring
Symposium on Computational Approaches to Analyzing Weblogs, pp.
100 107 2006
100-107, 2006.
• [11] M. R. AMINI (2000), “Interactive Learning for Text Summarization”,
Proceedings of the PKDD'2000 Workshop on Machine Learning and
Textual Information Access,
Access 2000.
2000
• [13] V. T. Esaú, V. P. Luis, and M. y. G. Manuel (2006), “Using Word
Sequences for Text Summarization”, Speech and Dialogue, Vol. 4188, pp.
293 300 2006
293-300, 2006.
References
• [14] 吳家威,
吳家威 劉昭麟 (2002),
(2002) “應用本體論設計與建置摘要系統”
應用本體論設計與建置摘要系統 , 2002民生電
子研討會論文集 (WCE'02), 台灣, 新竹, pp. 41-46, 2002.
• [15] 吳家威, 劉昭麟 (2003), “基於主題資訊賦予特徵不同比重之摘要系統”,
中華民國九十二年全國計算機會議論文集 (NCS (NCS'03)
03), 台灣,
台灣 台中,
台中 2003.
2003
• [16] 黃純敏, 吳郁瑩 (1999), “網路中文文件自動摘要”,網際網路研討會
TANET'99, 台灣, 高雄, 1999.
• [17] 黃純敏,
黃純敏 楊存一,
楊存 邱立豐 (2002) “TFIDF觀念於自動摘要實作評估”,
邱立豐, 觀念於自動摘要實作評估 第
十三屆國際資訊管理學術研討會, 台灣, 台北, 2002.
• [18] 黃純敏, 楊存一, 邱立豐, (2001) “中英文網路文件自動摘要之研究”, 第
七屆資訊管理研究暨實務研討會,
資 管 實務 會 台灣,
台灣 台北,
台 2001.
• [19] 葉鎮源 (2002), “文件自動化摘要方法之研究及其在中文文件的應用”,
碩士論文, 國立交通大學資訊科學研究所, 新竹, 2002.
• [20] 劉政璋, 葉鎮源, 柯皓仁, 楊維邦 (2005), “以概念分群為基礎之新聞事件
自動摘要”, 第十七屆自然語言與語音處理研討會, 台灣, 台南, 2005.
The End

S-ar putea să vă placă și