|Year : 2010 | Volume
| Issue : 2 | Page : 167-178
Automatic Multi-document Summarization Based on Clustering and Nonnegative Matrix Factorization
Sun Park1, ByungRea Cha2, Dong Un An3
1 Advanced Graduate Education Center of Jeonbuk for Electronics and Information Technology-BK21, Chonbuk National University, South Korea
2 SCENT Center, GIST, South Korea
3 Divison of Electronics and Information Engineering, Chonbuk National University, South Korea
|Date of Web Publication||27-Feb-2010|
Advanced Graduate Education Center of Jeonbuk for Electronics and Information Technology-BK21, Chonbuk National University
| Abstract|| |
In this paper, a novel summarization method that uses nonnegative matrix factorization (NMF) and the clustering method is introduced to extract meaningful sentences relevant to a given query. The proposed method decomposes a sentence into the linear combination of sparse nonnegative semantic features so that it can represent a sentence as the sum of a few semantic features that are comprehensible intuitively. It can improve the quality of document summaries because it can avoid extracting those sentences whose similarities with the query are high but that are meaningless by using the similarity between the query and the semantic features. In addition, the proposed approach uses the clustering method to remove noise and avoid the biased inherent semantics of the documents being reflected in summaries. The method can ensure the coherence of summaries by using the rank score of sentences with respect to semantic features. The experimental results demonstrate that the proposed method has better performance than other methods that use the thesaurus, the latent semantic analysis (LSA), the K-means, and the NMF.
Keywords: Clustering, Multidocument summarization, Nonnegative matrix factorization, LSA, Semantic feature, Semantic variable.
|How to cite this article:|
Park S, Cha B, An DU. Automatic Multi-document Summarization Based on Clustering and Nonnegative Matrix Factorization. IETE Tech Rev 2010;27:167-78
|How to cite this URL:|
Park S, Cha B, An DU. Automatic Multi-document Summarization Based on Clustering and Nonnegative Matrix Factorization. IETE Tech Rev [serial online] 2010 [cited 2013 Dec 12];27:167-78. Available from: http://tr.ietejournals.org/text.asp?2010/27/2/167/60169
| 1.Introduction|| |
The explosive increase in internet access has produced a large amount of information, and caused the problem that many documents with the same or similar topics are duplicated. This kind of data duplication problem increases the necessity for effective document summarization  .
Document summarization is the process of reducing the sizes of documents while maintaining their basic outlines. That is, the process should distill the most important information from the document. Document summarization can involve either generic summaries or query-based summaries. A generic summary distills an overall sense of a document's contents, whereas a query-based summary distills only the contents of a document that are relevant to a user's query  . Document summarization is further divided into single-document summarization or multidocument summarization according to the scope of the summary target. The purpose of multidocument summarization is to produce a single summary from a set of related documents, whereas single-document summarization is intended to summarize only one document  .
Radev et al.  suggested three points to be considered for multidocument summarization: (i) recognizing and removing redundancy, (ii) identifying important differences between documents, and (iii) ensuring the coherence of summaries. Redundancy represents how many terms or concepts are repeated across documents, while diversity or difference represents how many terms or concepts are different among the summarized documents  . Coherence represents how many summaries are readable and relevant to the user  .
Current multidocument summarization methods use various sentence clustering methods  . The problem is that these methods ignore subtopics within a cluster.
In this paper, we propose a new query-based multi-document summarization method using nonnegative semantic features and the clustering method. Nonnegative matrix factorization (NMF) can represent individual objects as the nonnegative linear combination of partial information extracted from a large volume of objects. It has been observed that humans use only the addition of nonnegative data when they recognize an object as the combination of partial information. This method can deal with a large volume of information efficiently since the original nonnegative matrix is decomposed into a sparsely distributed representation of two nonnegative matrices ,,, .
The proposed method has the following advantages. First, the semantic features are sparse with nonnegative values. Sentences can be decomposed into intuitively comprehensible semantic features having a few terms. The inherent structure of documents can be analyzed into a linear combination of semantic features. Therefore, the proposed method can select meaningful sentences that are more relevant to the query, and the extracted sentences are well connected with major topics and subtopics in the cluster. Second, it can find important sentences by using semantic features. Third, it can remove the noise in given documents by using clustering methods. Thus, the method can improve the quality of document summarization since the clustering of sentences helps us to remove redundant information easily and to avoid the biased inherent semantics of documents being reflected in the summarization. Finally, it can enhance the coherence of summaries by sorting extracted sentences in the order of their rank.
The rest of the paper is organized as follows: Section 2 describes the related studies regarding document summarization. In Section 3, we describe the NMF algorithm in detail. In Section 4, the new multidocument summarization method is introduced. Section 5 shows the evaluation and experimental results. Finally, we conclude in Section 6.
| 2.Related works|| |
The recent studies for document summarization are as follows: Gong and Liu  proposed a method using the latent semantic analysis (LSA) technique to semantically identify important sentences for summary creation.
Goldstein et al.  proposed a method using the maximal marginal relevance (MMR) approach. This method summarizes documents by calculating the cosine similarity between a given query and a document and the cosine similarity between the currently selected sentence and the previously selected sentence. This method cannot distinguish sentences including either a polysemy or a homonym even if their relevance to the query is high, because it only uses the cosine similarity.
Hachey et al.  proposed a query-based multidocument summarization method using the MMR and LSA. However, the shortcoming of this method is that it may summarize less meaningful sentences when the term weight of the semantic feature vector in the latent semantic space is negative  . Harabagiu and Lacatusu  proposed various multidocument summarization methods using sentence extraction based on both themes and sentence ordering. Their methods summarize documents by using theme selection based on natural language processing. These methods have high computational costs because they require many steps for theme selection.
Sassion  proposed a multidocument summarization method based on topics. His method summarizes documents by removing irrelevant sentences one by one from a set of candidate sentences until a user's specified compression ratio is met. This method shares the weakness of not being able to distinguish sentences including either a polysemy or a homonym because it relies upon the cosine similarity between candidate sentences, though it also refers to the n-gram between them.
Sakurai and Utsumi  proposed a query-based summarization method using a thesaurus. Their method generates the core part of the summary from the most relevant document to the query using a thesaurus, and then the additional part of the summary, which elaborates upon the query, from the other documents. Their method works well for long summaries, while its performance is not satisfactory for short summaries.
Park et al.  proposed a query-based document summarization method using NMF. Park et al.  proposed a multidocument summarization method based on clustering using NMF. This method clusters the sentences and extracts sentences using the cosine similarity measure between a topic and semantic features. This method improves the quality of summaries and avoids the topic being deflected in the sentence structure by clustering sentences and removing noise, but it may also extract more or less similar but meaningless sentences from documents and does not consider the coherence of the extracted sentences. Park et al. proposed a multidocument summarization method using weighted NMF and clustering methods ,, .
| 3.Nonnegative Matrix Factorization|| |
In this paper, we define the matrix notation as follows. Let X* j be the j' th column vector of matrix X, Xi* be the i' th row vector, and Xij be the element of the i' th row and the j' th column. Nonnegative matrix factorization (NMF) is to decompose a given m x n matrix A into a nonnegative semantic feature matrix (NSFM) W and a nonnegative semantic variable matrix (NSVM) H as shown in Equation (1):
We use the objective function that minimizes the Euclidean distance between each column of A and its approximation Ã= WH, which was proposed by Lee and Seung , . As an objective function, the Frobenius norm is used  :
where r are usually chosen to be smaller than m or n so that the total size of W and H is smaller than that of matrix A.
This is lower bounded by zero, and clearly vanished if and only if A = WH. W and H are continuously updated until ⊕E (W, H) converges under the predefined threshold or exceeds the number of repetitions. The update rules are as follows:
A column vector corresponding to the j' th sentence, A*j , can be represented as a linear combination of the semantic feature vectors W*l and the semantic variable Hl j as follows:
Example (1). [Table 1] shows some of the sentences extracted from a document on the topic "Native American reservation system-pros and cons" in the Document Understanding Conference (DUC) data set  . We generate the term-by-sentence matrix A by preprocessing a set of sentences in [Table 1]. Matrix A is composed of 541 terms and 54 sentences. [Table 2] illustrates cases of applying NMF to matrix A. [Figure 1] shows an interpretation of basis vectors and feature vectors.
[Table 2] illustrates 10 semantic feature vectors, W*1 ,…, W*10 , obtained from the NMF decom position of matrix A, the weight values, H1, 20 ,…, H10, 20 , of semantic feature vectors with respect to sentence S20, the original sentence vector, and the sentence vector calculated from the weight values and the semantic feature vectors.
There are no negative values and many zero values in [Table 2]. That is, semantic feature vectors obtained by using NMF are sparse so that NMF can obtain semantic features having a small range of semantics. This indicates that a method that uses NMF has better power to identify subtopics of documents than do methods that use decomposition approaches, such as principal components analysis (PCA) and vector quantization (VQ)  . Besides, the semantic feature vectors in [Table 2] intuitively make more sense because NMF represents a sentence as the linear combination of a few intuitive semantic feature vectors having nonnegative values.
3.1. Comparison of Latent Semantic Analysis and Nonnegative Matrix Factorization
Latent semantic analysis (LSA) is a decomposition method using singular value decomposition (SVD). This method decomposes matrix A into three matrices, U, D, and VT  :
where U is an m x n orthonormal matrix of eigenvectors of AAT (left singular vectors), and V is an n x n orthonormal matrix of eigenvectors of ATA (right singular vectors). D = diag(σ1,σ2 ...,σn) is an n x n diagonal matrix whose diagonal elements are nonnegative eigenvalues sorted in a descending order. u is a m x u matrix, where uij = Uij if 1 ≤ i ≤ m and 1 ≤ j ≤ r. D is a r x r matrix,
Finally, r < n.
In the method using LSA, the i'th column vector A*i of matrix A is the weight vector of the i'th sentence, and is represented as the linear combination of the left eigenvectors U*j , which are semantic feature vectors, as shown in Equation (6). That is, the weight of the j'th semantic vector U*j corresponding to the sentence vector A*i is σj V ij .
Example (2). We illustrate an example using LSA and NMF. In LSA, let r be 3. LSA decomposes matrix A into U, D, and V, as shown in [Figure 2]a. [Figure 2]b shows an example of sentence representation using LSA. The column vector A*3 corresponding to the third sentence is represented as a linear combination of feature vectors U*j and their weightsσjVij. In NMF, let r be 2, the number of repetitions 50, and the tolerance 0.001. When the initial elements of the W and H matrices are 0.5, the nonnegative matrix A is decomposed into two nonnegative matrices, W and H, as shown in [Figure 2]c. [Figure 2]d shows an example of sentence representation using NMF. The column vector A*3 corresponding to the third sentence is represented as a linear combination of the semantic feature vectors W*l and the semantic variable column vector H*3 .
Example (3). We analyzed examples of semantic features using LSA and NMF from "Tourism in Britain" in the DUC data set. [Figure 3] shows the term weights included in the semantic features U*1 by using LSA and in W*1 by using NMF. The total number of terms is 168. The number of nonzero weights of the semantic feature vectors from LSA and NMF are 168 and 21, respectively. The average term weights are 0.11 and 0.71, respectively. Here, we take the absolute values for U*1 . However, the maximum value of the term weights in the semantic feature vector from NMF (1.6) is much larger than that from LSA (-0.49). These results indicate that the semantic features from NMF consist of very few terms that have important meanings. In other words, NMF can more intuitively find comprehensible semantic features. These semantic features can be used more appropriately for subtopics of target documents.
| 4.Multidocument Summarization Based on the Semantic Feature and Clustering Methods|| |
In this paper, we propose a multidocument summarization method using the nonnegative semantic feature and clustering methods. The proposed method consists of the preprocessing phase, clustering phase, and the summary generation phase. We will give a full explanation of the three phases mapped in [Figure 4].
In the preprocessing phase, after given documents are decomposed into individual sentences, we remove stop-words and perform word stemming. Then, we construct the weighted term-frequency vector for each sentence in the documents using Equation (5)  . Let A be the m x n matrix, where m is the number of terms and n is the total number of sentences in the collected documents. Let element A ij be the weighted term-frequency of term i in the sentence j:
Where L ij is the local weight (term frequency) for term i in the sentence j, and G(i) is the global weight (inverse document frequency) for term i in the collected documents. That is,
Where N is the total number of sentences in the collected documents, and N(i) is the number of sentences that contain term i.
4.2 Clustering Phase
The clustering phase consists of applying the clustering methods and computing the number of sentence extractions for each cluster.
4.2.1 Clustering methods
In this paper, we use the K-means clustering method  and Xu et al.'s  document clustering method.
K-means clustering is a partition algorithm that splits a given set of n objects into K-clusters  . We perform K-means clustering using the cosine similarity measure with respect to matrix A as shown in Equation (7):
where A*a and A*b denote weights for the a'th sentence and the b'th sentence, respectively. Besides, since these are nonnegative values, it follows that 0 ≤ sim () ≤ 1. Hence, 0 ≤ d () ≤ 1.
Xu's document clustering method is as follows: We decompose the documents into individual sentences; let k be the number of cluster labels, and perform the preprocessing phase. We then perform NMF on A to obtain the two nonnegative matrices W and H. We use matrix H to determine the cluster label of each data point. We assign sentence A*j to cluster x if  .
The weighted matrix of the i'th cluster of sentences, C i, is a subset of the column vectors of matrix A. C i and C j are disjoints and satisfy the following property:
4.2.2 Computing the number of sentences extracted
The number of sentences extracted from cluster Cc , ec , is as follows:
where f is the number of summarized sentences, N is the total number of sentences, sc is the number of sentences in Cc , sim() is the cosine similarity function, and q is the query.
4.3 Summary Generation
The summary generation phase consists of sentence extraction and sentence ranking.
4.3.1 Sentence extraction using semantic features
The sentence extraction process is described as follows. We construct matrices Wc and Hc by applying the NMF algorithm to Cc after removing noise as shown in Equation (11). We calculate the number of sentences extracted from the cluster label by using Equation (10). We select the semantic feature having the largest similarity value to the query by using Equation (8) and then extract the sentence having the largest weight with respect to this semantic feature. We add the extracted sentences to the candidate sentence set:
where k is the number of cluster labels, and knoise is the number of noise cluster labels.
A weighted column vector for the j'th sentence of matrix Cc is represented as a linear combination of the semantic feature vectors and the semantic variable . The weight of the l'th semantic feature vector in sentence .
The powers of the two nonnegative matrices Wc and Hc are described as follows: All semantic variables, are used to describe how the j' th sentences are structured using semantic features. Wc and Hc are represented sparsely. Intuitively, it make more sense for each sentence to be associated with some small subset of a large array of topics, , rather than with just one topic or with all the topics. In each semantic feature, , semantically related terms are grouped together by NMF. In addition to grouping semantically related terms together into semantic features, it represents the multiple meanings of the same term on contexts  .
4.3.2 Sentence ranking
Summary generation arranges the ranked sentences from the candidate sentences set by using Equation (13). We defined the rank score of a sentence as follows:
where ri is the rank score of the j' th sentence, and Rweight() is the rank weight. Here,
The rank weight, Rweight(), means the relative relevance of the l'th semantic feature among all semantic features. The rank weight also indicates how much the sentence reflects the major topic, which is represented as semantic features.
4.4 Multidocument Summarization Algorithm
The proposed multidocument summarization algorithm using both the nonnegative semantic feature method and the clustering method is as follows:
1. Perform the preprocessing phase and clustering methods.
2. Construct c clusters, Cc , where c = 1,…, k', from matrix A, remove noise clusters, and then compute the number of sentences, ec , extracted from each cluster Cc.
3. Obtain the nonnegative matrices Wc and Hc from the matrix Cc .
4. Perform the following steps for each cluster Cc :
(c) Put the sentences corresponding to into a candidate sentence set.
(d) Repeat from step (a) to step (c) until the number of sentences ec is selected. Here, we choose the largest value excluding the previously selected ones when steps (a) through (c) are repeated.
In step 2, we gather redundant sentences in the same cluster, and remove noise clusters. In step 4(a), we select the semantic feature vector most similar to query q. In step 4(b), we select the q'th column of having the largest value Hpj c among the p'th row of H in order to choose the sentence that has the largest weight with respect to the most relevant semantic feature vector.Example (4). We illustrate the example of sentence extraction with respect to step 4 of the multidocument summarization algorithm. [Table 3] shows five sentences and a query. We generate matrix A by preprocessing a set of sentences in [Table 3] and decompose matrix A into a semantic feature matrix and a semantic variable matrix by using NMF. [Figure 5] illustrates the sentence extraction process from the set of sentences in [Table 3]. In [Figure 5](1), we calculate the similarity between the query and the semantic feature vectors and select the semantic feature vector W*3 having the largest similarity value (0.68). In [Figure 5](2), we select the semantic variable vector H*3 that corresponds to semantic feature vector W*3 . In [Figure 5](3), we extract the sentence S3 that corresponds to the semantic variable having the largest value (0.83) in semantic variable vector H*3 .
| 5.Experiments|| |
5.1 Data set
The DUC is the international conference for performance evaluation in the area of text summarization. As an experimental data set, we use the task from DUC 2006. The DUC 2006 test data set is composed of 50 topics and 25 documents relevant to each topic from the AQUAINT corpus for query-relevant multi-document summarization  .
5.2 Performance Evaluation Measure
To compare the performances, we used the ROUGE evaluation software package, which compares various summary results from several summarization methods with summaries generated by humans. ROUGE has been applied by the Document Understanding Conference (DUC) for performance evaluation. ROUGE includes five automatic evaluation methods, ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S, and ROUGE-SU  . Each method estimates recall, precision, and f-measure between experts' reference summaries and candidate summaries of the proposed system. ROUGE-N uses the n-gram recall between a candidate summary and a set of reference summaries. ROUGE-N is computed as follows:
where n is the length of the n-gram and gramn, and Countmatch (gram n ) is the maximum number of n-grams co-occurring in a candidate summary and a set of reference summaries  ROUGE-L computes the ratio between the length of the summaries' longest common subsequence (LCS) and the length of the reference summary as delineated by Equation (16):
where m is the length of reference summary sentence X and n is the length of candidate sentence Y. LCS(X,Y) is the length of the LCS of X and Y. Rlcs is a recall of LCS(X,Y), Plcs is the precision of LCS(X,Y), and β = Plcs /R lcs  . ROUGE-W uses the weighted LCS that favors LCS with consecutive matches. ROUGE-S uses the overlap ratio of the skip-bigram between a candidate summary and a set of reference summaries, as given by Equation (17):
where SKIP2(X,Y) is the number of skip-bigrams between X and Y, β is the relative importance of Pskip2 and Rskip2 , Pskip2 is the precision of SKIP2(X,Y), and Rskip2 is a recall of SKIP2(X,Y). C() is a combination function  . ROUGE-SU is an extension of ROUGE-S with the addition of unigram as the counting unit  .
5.3 Results and Discussion
We implemented seven kinds of multidocument summarization methods (THESAURUS, LSA, KMEAN, NMF, KMEAN + NMF , WeightNMF, NC + NMF ) using the DUC 2006 data set. Our proposed methods are NMF, KMEAN + NMF , WeightNMF, and NC + NMF .
The THESAURUS denotes Sakurai and Utsumi's  method that uses a thesaurus. We modify Sakurai's method using the Moby thesaurus II  . LSA denotes Gong and Lin's  method using latent semantic analysis. KMEAN denotes the multi-document summarization method using K-means clustering  , which clusters the sentences and extracts from each cluster the sentence whose similarity value with respect to a given query is the largest. NMF denotes the query-based document summarization method using NMF  . KMEAN + NMF denotes the multidocument summarization method using NMF and K-means clustering  Weight NMF denotes the multidocument summarization method using weighted NMF and K-means clustering , . NC + NMF denotes the multidocument summarization method using NNF and NMF clustering  .
[Figure 6] illustrates how the evaluations in this experiment are performed. As a test data set, we randomly selected 50 documents from 50 document collections in the DUC 2006 data set. Each document has a human-produced summary. Our methods (NMF, KMEAN 1 NMF, WeightNMF, NC 1 NMF) and three other methods were used to produce summaries from the test documents. These summaries were input into the ROUGE software package to produce ROUGE evaluation values.
Experiment 1. We compared the ROUGE results of the seven different summarization methods: THESAURUS, LSA, KMEAN, NMF, KMEAN + NMF , WeightNMF, and NC + NMF . [Figure 7] shows the ROUGE results of the average recall. The ROUGE results for the average precision and the f-measure are shown in [Figure 8] and [Figure 9].
In [Figure 7], the average recall of NC + NMF is approximately 45.37% higher than that of THESAURUS, 14.84% higher than that of LSA, 20.48% higher than that of KMEAN, 16.27% higher than that of NMF, 11.65% higher than that of KMEAN + NMF , and 5.53% higher that of WeightNMF.
In [Figure 8], the average precision of NC + NMF is approximately 24.13% higher than that of THESAURUS, 41.38% higher than that of LSA, 45.85% higher than that of KMEAN, 25.56% higher than that of NMF, 16.04% higher than that of KMEAN + NMF , and 5.14% higher that of Weight NMF.
In [Figure 9], the average f-measure of NC + NMF is approximately 23.41% higher than that of THESAURUS, 27.41% higher than that of LSA, 30.28% higher than that of KMEAN, 22.42% higher than that of NMF, 13.22% higher than that of KMEAN + NMF , and 6.35% higher that of WeightNMF.
Experiment 2. We conducted performance evaluation (t-test) using the ROUGE measure with respect to four different summarization methods (THESAURUS, LSA, KMEAN, and NC + NMF ). For the t-test evaluation, we established several hypotheses. Examples of our hypotheses include the following: "our proposed method (NC + NMF) is superior to THESAURUS in ROUGE-1 recall," "our proposed method (NC + NMF) is superior to LSA in ROUGE-W f-measure," etc. The significance level is 5% and the number of samples is 50. The acceptance region of t is t > t0.05(50) < 1.676.
If t is larger than 1.676, the hypothesis is accepted; otherwise, it is rejected. The t-test evaluation results are presented in [Table 4],[Table 5], and [Table 6]. In this experiment, all hypotheses were accepted. However, the f-measure is more important, so it is the synthesis of recall and precision.
The LSA method shows poor performance because it only uses the latent semantic structure without considering the topic. The KMEAN method shows better performance than the THESAURUS method because the summarization method of removing redundancy and identifying subtopics by K-means clustering results in more important summaries than that using a thesaurus. The NMF approach shows better performance than the KMEAN approach because the method reflecting the inherent semantics of documents by using NMF results in more meaningful summaries than that using K-means to remove noise clusters. The KMEAN + NMF combined approach shows better performance than NMF alone. This is because the KMEAN + NMF approach generates a more meaningful summary by reflecting the inherent semantics of documents without noise sentences, whereas the NMF approach may produce a summary by reflecting the inherent semantics of biased documents mixed with noise. The WeightNMF method shows better performance than the KMEAN + NMF method. This is because the WeightNMF method prevents the extraction of less meaningful sentences since the semantic feature whose cosine similarity with respect to a topic is high but meaningless is not selected by using the weighted similarity measure. The NC + NMF approach shows better performance than the other methods. It generates a more meaningful summary by reflecting the inherent semantics of documents without noise and redundancy, and it prevents the extraction of less meaningful sentences since the semantic feature whose cosine similarity with respect to a query is high but meaningless is not selected by using nonnegative semantic features and NMF clustering.
| 6.Conclusions and Future Research|| |
For effective multidocument summarization, it is important to remove noise, recognize and remove redundant information, ensure the coherence of summaries, and extract sentences which are common to given documents. This paper presents a multidocument summarization method using nonnegative semantic features and clustering methods. The advantages of the proposed method are as follows. First, it can represent documents by means of an intuitively comprehensible form since it uses very sparse semantic features. Therefore, it can extract the sentences that are semantically closer to the query and prevent the extraction of more or less similar but meaningless sentences. In addition, the extracted sentences are well covered with the major topics and subtopics in the cluster. Second, it removes the redundancy of sentences within a cluster and identifies the important difference of sentences between clusters. Thus, it can avoid the biased inherent structure of documents to be reflected in summaries. Third, it ensures the coherence of summaries by using a rank score for sentences. Experimental results show that the proposed method outperforms four other summarization methods.
In the future, we will study query expansion to summarize documents. We anticipate that query expansion will enable us to extract sentences more relevant to the query and improve the accuracy of document summarization.
| Auhtors|| |
Sun Park is a postdoctoral researcher at Chonbuk National University, Korea. He received the Ph.D. degree in Computer and Information Engineering from Inha University in 2007, the M.S. degree in Information and Communication Engineering from Hannam University in 2001, and the B.S. degree in Computer Engineering in 1996. Prior to becoming a researcher at Chonbuk National University, he has worked as a professor in Department of Computer Engineering, Honam University, Korea. His research interests include Data Mining, Information Retrieval, and Information Summarization.
ByungRae Cha is a research professor at Super Computing and Collaboration Environment Technology (SCENT) Center, GIST, Korea. He received the Ph.D. degree in Computer Engineering from National Mokpo University, and the M.S. degree in Computer Engineering from Honam University. Prior to becoming a researcher at GIST, he has worked as a research professor in Department of Information and Communication Eng., Chosun University, and professor in Department of Computer Engineering, Honam University, Korea. His research interests include Computer Security of IDS and P2P, Neural Networks Learning, and Context-Awareness in Ubiquitous Computing System.
Dong Un An is a professor at Chonbuk National University, Korea. He received the Ph.D. degree in Computer Engineering from KAIST in 1995, the M.S. degree in Computer Engineering from KAIST in 1987, and the B.S. degree in Electronic Engineering from Hanyang University in 1981. His research interests include Natural Language Processing, Information Retrieval, and Machine Translation.
| References|| |
|1.||T. Sakurai, and A. Utsumi, "Query-based Multi-document Summarization for Information Retrieval," in Evaluation of Information Access Technologies Research Infrastructure for Comparative Evaluation of Information Retrieval and Access Technologies. Tokyo, Jun. 2004. |
|2.||I. Mani. Automatic Summarization, John Benjamins Publishing Company; 2001. |
|3.||D.R. Radev, E. Hovy, and K. Mckeown. "Introduction to the Special Issue on Summarization," Computational Linguistics, vol. 28, pp. 399-408, Dec. 2002. |
|4.||J. Goldstein, V. Mittal, J. Carbonell, and M. Kantrowitz, "Mult-Document Summarization By Sentence Extraction," in Applied Natural Language Processing/North American Chapter of the Association for Computational Linguistics Workshop, Seattle, pp. 40-8. Apr. 2000. |
|5.||T. Nomoto, and Y. Matsumoto. "A New Approach to Unsupervised Text Summarization," in ACM SIGIR conference on research and development in information retrieval. Louisiana, pp. 26-34, Sep. 2001. |
|6.||S.J. Karen. "Automatic summarizing: The state of the art," Information Processing and Management, vol. 43, pp. 1449-81, Sep. 2007. |
|7.||D.D. Lee, and H.S. Seung. "Learning the parts of objects by non-negative matrix factorization," Nature, vol. 401, pp. 788-91, Oct. 1999. |
|8.||D.D. Lee, and H.S. Seung. "Algorithms for non-negative matrix factorization," In Advances in Neural Information Processing Systems, vol. 13, pp. 556-62, Apr. 2001. |
|9.||S. Wild, J. Curry, and A. Dougherty. "Motivating Non-Negative Matrix Factorizations," In SIAM Applied Linear Algebra. Williamsburg, Jul. 2003. |
|10.||W. Xu, X. Liu, and Y. Gong. "Document Clustering Based on Non-negative Matrix Factorization," in ACM SIGIR conference on research and development in information retrieval. Toronto, pp. 267-73, Aug. 2003. |
|11.||Y. Gong, and X. Liu. "Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis," in ACM SIGIR conference on research and development in information retrieval. New Orleans, pp. 19-25, Sep. 2001. |
|12.||J. Goldstein, V. Mittal, J. Carbonell, and J. Callan. "Creating and Evaluating Multi-Document Sentence Extract Summaries," in Conference on Information and Knowledge Management. McLean, VA, pp. 165-72, Nov. 2000. |
|13.||B. Hachey, G. Murray, and D. Reitter. "The Embra System at DUC 2005: Query-oriented Multi-document Summarization with a Very Large Latent Semantic Space," in the Document Understanding Conference. Vancouver, Oct. 2005. |
|14.||S. Harabagiu and F. Lacatusu. "Topic Themes for Multi-Document Summarization," in ACM SIGIR conference on research and development in information retrieval. Salvador, Brazil, pp. 202-9, Aug. 2005. |
|15.||H. Sassion."Topic-based Summarization at DUC 2005," in Document Understanding Conference. Vancourver, Oct. 2005. |
|16.||S. Park, J.H. Lee, C.M. Ahn, J.S. Hong, and S.J. Chun. "Query Based Summarization using Non-negative Matrix Factorization," in Knowledge-Based Intelligent Information and Engineering Systems. Bournemouth, UK., pp. 84-9. Oct. 2006. |
|17.||S. Park, J.H. Lee, D.H. Kim, and C.M. Ahn. "Multi-document Summarization Based on Cluster Using Non-negative Matrix Factorization," in Conference on Current Trends in Theory and Practice of Computer Science. Harrachov, Czech Republic, pp. 761-70, Jan. 2007. |
|18.||S. Park, J.H. Lee, D.H. Kim, and C.M. Ahn. "Multi-document Summarization Using Weighted Similarity between Topic and Clustering-based Non-negative Matrix Factorization," in Annual International Conference on Asia Pacific Web. Suzhou, China, pp. 761-70, Apr. 2007. |
|19.||S. Park, and J.H. Lee. "Topic-based Multi-document Summarization Using Non-negative Matrix Factorization and K-means," Journal of KIISE: Software and Application. Vol. 35, pp. 255-64, Apr. 2008. |
|20.||S. Park. "Query-based Multi-document Summarization Using Non-negative Semantic Features and NMF clustering," in International Conference on Networked Computing and Advanced Information Management, Gyeongju, Korea, pp. 609-14, Sep. 2008. |
|21.||H.D. Hoa. "Overview of DUC 2006," in Document Understanding Conference. New York City, Jun. 2006. [PUBMED] |
|22.||B.Y. Ricardo, and R.N. Berthier. Moden Information Retrieval. ACM Press; 1999. |
|23.||J. Han, and M. Kamber. Data Mining Concepts and Techniques, Second Edition. Morgan Kaufmann; 2006. |
|24.||C.Y. Lin. "ROUGE: A Package for Automatic Evaluation of Summaries," in Workshop on Text Summarization Branches Out, Post-Conference Workshop of Association for Computational Linguistics, Barcelona, Jul. 2004. |
|25.||http://en.wikipedia.org/wiki/Moby_Thesaurus#Thesaurus . |
[Figure 1], [Figure 2], [Figure 3], [Figure 4], [Figure 5], [Figure 6], [Figure 7], [Figure 8], [Figure 9]
[Table 1], [Table 2], [Table 3], [Table 4], [Table 5], [Table 6]
|This article has been cited by|
||AN OPTIMIZATION APPROACH TO AUTOMATIC GENERIC DOCUMENT SUMMARIZATION
| ||Rasim M. Alguliev,Ramiz M. Aliguliyev,Chingiz A. Mehdiyev |
| ||Computational Intelligence. 2013; 29(1): 129 |
||An optimization approach to automatic generic document summarization
| ||Alguliev, R.M. and Aliguliyev, R.M. and Mehdiyev, C.A. |
| ||Computational Intelligence. 2013; 29(1): 129-155 |
||DESAMC+DocSum: Differential evolution with self-adaptive mutation and crossover parameters for multi-document summarization
| ||Alguliev, R.M. and Aliguliyev, R.M. and Isazade, N.R. |
| ||Knowledge-Based Systems. 2012; 36: 21-38 |
||Personalized document summarization using pseudo relevance feedback and semantic feature
| ||Park, S. and Cha, B.R. and Kwon, J. |
| ||IETE Journal of Research. 2012; 58(2): 155-165 |
||DESAMC+DocSum: Differential evolution with self-adaptive mutation and crossover parameters for multi-document summarization
| ||Rasim M. Alguliev,Ramiz M. Aliguliyev,Nijat R. Isazade |
| ||Knowledge-Based Systems. 2012; 36: 21 |
||pSum-SaDE: A Modified p-Median Problem and Self-Adaptive Differential Evolution Algorithm for Text Summarization
| ||Rasim M. Alguliev,Ramiz M. Aliguliyev,Chingiz A. Mehdiyev |
| ||Applied Computational Intelligence and Soft Computing. 2011; 2011: 1 |