In this post I’m going to show a simple machine learning experiment where I perform a sentiment classification task on a movie reviews dataset using WEKA, an open source data mining tool. The goal is to classify a movie review as positive or negative (for the reviewed movie).
I start by importing the reviews dataset in WEKA, then I perform some text preprocessing tasks such as word extraction, stop-words removal, stemming and term selection. Finally, I run various classification algorithms (naive bayes, k-nearest neighbors) and I compare the results, in terms of classification accuracy.
The movie reviews dataset
The dataset consists of 2000 user-created movie reviews archived on the IMDb (Internet Movie Database) web portal at http://reviews.imdb.com/Reviews and is known as “Sentiment Polarity Dataset version 2.0” (http://www.cs.cornell.edu/People/pabo/movie-review-data). The reviews are equally partitioned into a positive set and a negative set (1000+1000).
Each review consist of a plain text file (.txt) and a class label representing the overall user opinion. The class attribute has only two values: pos (positive) or neg (negative). The dataset creators explain that the class assignment to each review has been determined by using some simple rules that consider the user’s vote for the movie, as extracted from the original review. For instance, if a 1-10 scale is used, a vote greater or equal to 6 is considered as positive and anything less than 6 as negative. The authors also state that similar criteria have been used in the case of different vote scales (e.g., 1-5, A-F, etc.).
Importing the dataset in WEKA
First thing to be done is to import the dataset in the WEKA tool. Practically speaking, we have an archive containing 2000 text files partitioned in two sub-directories pos and neg (the class values). WEKA provides a simple import procedure for textual datasets, by means of the TextDirectoryLoader component. By using this loader (fig. 1), WEKA automatically creates a relation with 2 attributes: the first one contains the text data, the second is the document class, as determined by the sub-directory containing the file (pos or neg).
After the import, you can save the dataset in the ARFF format and manually edit it with a text editor to set some decent names for the relation and the attributes. Figure 2 shows the resulting relation.
As expected, we get a relation containing 2000 instances and two attributes (text and class). The histogram in the figure shows the uniform distribution of the review classes (blue = negative, red = positive).
Text preprocessing and feature extraction in WEKA
For the classification task to be done, a preliminary phase of text preprocessing and feature extraction is essential. We want to transform each text in a vector form, in which each document is represented by the presence (or frequency) of some “important” terms; this terms are the ones contained in the collection vocabulary. To build the vocabulary, various operations are typically performed (many of which are language-specific):
- Word parsing and tokenization
In this phase, each document is analyzed with the purpose of extracting the terms. Separator characters must be defined, along with a tokenization strategy for particular cases such as accented words, hyphenated words, acronyms, etc.
- Stop-words removal
A very common technique is the elimination of frequent usage words: conjunctions, prepositions, base verbs, etc. This kind of terms should be filtered as they have a poor characterizing power, making them useless for the text classification.
- Lemmatization and stemming
The lemmatization of a word is the process of determining its lemma. The lemma can be thought of as the “common root” of various related inflectional forms; for instance, the words walk, walking and walked all derive from the lemma walk.
A simple technique for approximated lemmatization is the stemming. Stemming algorithms work by removing the suffix of the word, according to some grammatical rules.
- Term selection/feature extraction
The term set resulting from the previous phases has still to be filtered, since we need to remove the terms that have poor prediction ability (w.r.t the document class) or are strongly correlated to other terms. This term selection task also leads to a simpler and more efficient classification.
To perform the preprocessing in WEKA, I used the StringToWordVector filter from the package weka.filters.unsupervised.attribute. This filter allows to configure the different stages of the term extraction (fig.3). Indeed, you can:
- Configure the tokenizer (term separators);
- Specify a stop-words list;
- Choose a stemmer.
The preprocessing operations will affect the quality of the classification, for this reason I will perform various experiments on different generated datasets.
The default text retrieval model used by the StringToWordVector filter is boolean: each document is represented with an n-dimensional boolean vector, where n is the size of the vocabulary, and each value models the presence or the absence of a vocabulary term in the document. One can also choose to use a frequency-based model such as the TF-IDF weighting model by setting to true the TFTransform and IDFTransform parameters.
You can set a stop-words list by clicking on stopwords and setting to true the useStopList parameter. In my experiments I used a 630 english stop-words list (whose origin I don’t recall) and the Porter’s stemmer (for the english language). Before you can use it, you must download and add to the classpath the snowball stemmers library.
Furthermore, you can set a maximum limit on the number of words to be extracted by changing the wordsToKeep parameter (default is 1000 words) and a minimum document frequency for each term by means of the minTermFreq parameter (default is 1). The latter parameter makes the filter drop the terms that appear in less than minTermFreq documents (of the same class – see note). NOTE: by default, these two parameters are considered with respect to the document class. To have them applied without considering the class, set DoNotOperateOnPerClassBasis to true.
After applying the StringToWordVector filter, we get the result shown in figure 4.
We get a relation containing 1251 binary attributes. The histogram shows the distribution of the term ‘bad‘ among the documents: mostly, it appears (value 1) in negative reviews (blue color).
The last preprocessing operation is the attribute selection. Like I said, eliminating the poorly characterizing attributes can be useful to get a better classification accuracy. For this task, WEKA provides the AttributeSelection filter from the weka.filters.supervised.attribute package. The filter allows to choose an attribute evaluation method and a search strategy (fig. 5).
The default evaluation method is CfsSubsetEval (Correlation-based feature subset selection). This method works by choosing attributes that are highly correlated with the class attribute while having a low correlation with other attributes.
After applying the AttributeSelection filter, we obtain the result show in figure 6.
As you can see, the number of attributes has greatly decreased, passing from more than 1000 to just 52.
The classification problem is a supervised learning task that consists in assigning a class label to an unclassified tuple according to an already classified instance set, that is used as a training set for the algorithm.
The two classification algorithms that I’m going to use are Naive Bayes and k-nearest neighbors (k = 1, 3, 5). The quality measure that will be considered is the percentage of correctly classified instances. For the validation phase I’ll use the 10-fold cross validation method.
Naive bayes classifier
The Naive Bayes classifier assigns a class to an unlabeled object according to a maximum likelihood principle. More specifically, the instance is assigned the class which maximizes the “a posteriori” probability, which is a function of the class prior probability and of the instance likelihood w.r.t to the class (fig. 7).
This classifier is called naive because it assumes the attribute indipendence hypothesis. With that assumption, the computation of the conditional probability P(d | c) becomes just a calculation of a product between the probabilities of each attribute (fig. 8).
In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes component from the weka.classifiers.bayes package. The best result achieved with this classifier has shown a correctness percentage of 81,45% (fig. 9), using a dataset on which only attribute selection was performed (no stop-words removal, no stemming – see comparison table).
K-nearest neighbors classifier
The k-nearest neighbors classifier assigns to an instance the class that is prevalent among the k nearest instances. For this to be possible, a distance function between the instances must be defined. In this experiment the euclidean distance was used. The k parameter is chosen to be an odd number, so that a majority always exists. When k = 1, the classifier simply becomes a nearest neighbor classifier.
In WEKA, the k-NN classifier is implemented in the weka.classifiers.lazy.IBk component. As already said, NN, 3-NN and 5-NN were used in this experiment. The best result achieved with this kind of classifiers has shown a correctness percentage of 73,95% (fig. 10), using the 3-nearest neighbors classifier on a dataset on which stop-words removal, stemming and attribute selection were performed (see comparison table).
Comparison of results