Sunteți pe pagina 1din 3

DEEP LEARNING MODEL USED IN TEXT

CLASSIFICATION

Name - Shwetank Dwivedi


Branch – ECE
Roll No – 41820902817
Date - 7 November 2019
Paper - 2

Abstract - Text classification is one of the most widely used natural language processing
technologies. Common text classification applications include spam identification, news text
classification, information retrieval, emotion analysis, and intention judgment, etc.
Traditional text classifiers based on machine learning methods have defects such as data
sparsity, dimension explosion and poor generalization ability, while classifiers based on deep
learning network greatly improve these defects, avoid cumbersome feature extraction
process, and have strong learning ability and higher prediction accuracy. For example,
convolutional neural network (CNN) .This paper introduces the process of text classification
and focuses on the deep learning model used in text classification.

Introduction - In recent years, artificial intelligence develops rapidly and gradually changes
our life. Natural Language Processing is an artificial intelligence technology full of charm
and challenge. It includes syntactic semantic analysis, information extraction, text
classification, machine translation, information retrieval, dialogue system and so on. Text
classification is one of the most widespread applications, such as spam identification, news
text categorization, information retrieval, emotional analysis, and intention judgment. The
process of text classification includes text preprocessing, text feature extraction and
classification model construction. First, due to the particularity of text structure, the text
prediction first needs to be preprocessed, which generally requires the removal of stop
words, and special text also needs to undergo special processing, such as word segmentation
in Chinese processing. Second, feature engineering is established for pre-processed texts to
extract key features reflecting text features from texts so as to establish the mapping between
features and classification. Finally, the classification model is established. This paper mainly
thinks about it based on the deep learning method.

Text classification
The purpose of text classification is to determine the category of a given document, and
the result of classification may be binary classification or multiple classification.
The basic steps of text classification include text preprocessing, text feature extraction and
classification model construction.

2.1. Text preprocessing


Preprocessing mainly includes word segmentation (required for Chinese processing),
removal of stop words and word vectorization. Word segmentation includes dictionary
construction and word segmentation algorithm operation. At present, the main popular way
to construct dictionaries is dictionary trees. Word tokenizer algorithms include forward
maximum matching , reverse maximum matching , bidirectional maximum matching,
language model method, shortest path algorithm, etc. Common Chinese tokenizer are jieba,
SnowNLP, THULAC, etc. The purpose of removing stop words is to filter the noise of word
segmentation results and make text classification more accurate. Word vectorization is to
better process data for the deep learning model.

2.2. Text feature extraction


This is a very important link in text classification. Words in the text represent a document
with a certain probability. The greater the probability, the more able the document can be
represented. Feature extraction not only saves computing resources and time, but also
improves the accuracy of model prediction. Common characteristics based on many
expressions such as the characteristics of the word bag model, the characteristics of
embedding, NN Model to extract the characteristics, the task itself to extract the
characteristics, and the characteristics of subject.

2.3. Commonly used deep learning models

2.3.1. TextCNN

CNN (convolutional neural network) is used to extract key information similar to n-gram
in sentences. It was proposed by Yoon Kim in 2014. The model contains following part:
Part1: Input Layer. The preprocessed text data is input into the model.
Part2: Embedding Layer. Text feature extraction.
Part3: Convolution Layer. Each convolution layer is established by filters of different
sizes in the result of obtaining multiple feature maps.
Part4: Max-Pooling Layer. The dimension of convolution layer is reduced.
Part5: SoftMax Layer. Outputting the probability of each category in a multicategory task.

2.3.2. TextRNN
One of the biggest problems of CNN is the size of filter is fix. On the one hand, it is
impossible to model longer sequence information; on the other hand, overparameter
adjustment of filter size is tedious.
However, TextRNN, or bi-directional RNN (bidirectional LSTM) can capture bi-
directional "n-gram" information with variable length.

2.3.3. TextRNN + Attention


Based on the TextRNN model, Attention mechanism is added into the model, which can
solve the problem of longterm dependence of text, intuitively present the contribution of each
word to the results, and form the processing framework of Seq2Seq model .
CONCLUSION
The kinds of text can be divided into long text and short text according to the length of
sentences. The reason is to consider the limitations of different deep learning models.
Text is a special sequence and the context contains semantic dependence. The RNN model
can deal with this dependence well to achieve significant classification effect, but the RNN
model cannot parallelize well, so it is more suitable for short text processing. The training
time is too long when processing texts with more than dozens of words.
CNN has been widely used in a long text classification model because of it can be highly
parallelized, the model using multiple channel, choose to use different size of the filter, and
the Max pooling to select the most influential and lower latitude high-dimensional
classification characteristics, and then use a dropout of full connection layer depth of text
feature extracting, the final classification results.
However, in order to train a good text classification model, it is obviously not enough to
rely only on the deep learning network algorithm, and it is important to understand the data.
Different data and tasks are suitable for different models, and analysis of bad case is valued,
so as to truly build a good text classification model.

REFERENCES
[1] Yoon Kim, “Convolutional Neural Networks for Sentence Classification”, EMNLP
2014, Part number.
1of1, pp. 1746-1751, Aug. 2014
[2] Li. Hui 1, Chen. Ping Hua, “Improved backtrackingforward algorithm for maximum
matching Chinese

word segmentation”, Applied Mechanics and Materials, v 536-537, p 403-406, 2014


[3] Liyi. Zhang, Yazi. Li, Jian. Meng, “Design of Chinese word segmentation system
based on improved Chinese converse dictionary and reverse maximum matching”,
Lecture Notes in Computer Science, v 4256 LNCS, p 171-181, 2006
[4] Gai. Rong Li, Gao. Fei Duan, Li Ming, Sun. Xiao Hui, Li. Hong Zheng,
“Bidirectional maximal matching word segmentation algorithm with rules”,
Advanced Materials Research, v 926-930, p 3368-3372, 2014

S-ar putea să vă placă și