Monthly Archives: September 2014

A data mining experiment: movie reviews classification using WEKA

In this post I’m going to show a simple machine learning experiment where I perform a sentiment classification task on a movie reviews dataset using WEKA, an open source data mining tool. The goal is  to classify a movie review as positive or negative (for the reviewed movie).

I start by importing the reviews dataset in WEKA, then I perform some text preprocessing tasks such as word extraction, stop-words removal, stemming and term selection. Finally, I run various classification algorithms (naive bayes, k-nearest neighbors) and I compare the results, in terms of classification accuracy.

The movie reviews dataset

The dataset consists of 2000 user-created movie reviews archived on the IMDb (Internet Movie Database) web portal at and is known as “Sentiment Polarity Dataset version 2.0” ( The reviews are equally partitioned into a positive set and a negative set (1000+1000).

Each review consist of a plain text file (.txt) and a class label representing the overall user opinion. The class attribute has only two values: pos (positive) or neg (negative). The dataset creators explain that the class assignment to each review has been determined by using some simple rules that consider the user’s vote for the movie, as extracted from the original review. For instance, if a 1-10 scale is used, a vote greater or equal to 6 is considered as positive and anything less than 6 as negative. The authors also state that similar criteria have been used in the case of different vote scales (e.g., 1-5, A-F, etc.).

Importing the dataset in WEKA

First thing to be done is to import the dataset in the WEKA tool. Practically speaking, we have an archive containing 2000 text files partitioned in two sub-directories pos and neg (the class values). WEKA provides a simple import procedure for textual datasets, by means of the TextDirectoryLoader component. By using this loader (fig. 1), WEKA automatically creates a relation with 2 attributes: the first one contains the text data, the second is the document class, as determined by the sub-directory containing the file (pos or neg).

Fig. 1: importing the dataset in WEKA

After the import, you can save the dataset in the ARFF format and manually edit it with a text editor to set some decent names for the relation and the attributes. Figure 2 shows the resulting relation.

Fig. 2: imported dataset and class distribution
Fig. 2: imported dataset and class distribution

As expected, we get a relation containing 2000 instances and two attributes (text and class). The histogram in the figure shows the uniform distribution of the review classes (blue = negative, red = positive).

Text preprocessing and feature extraction in WEKA

 For the classification task to be done, a preliminary phase of text preprocessing and feature extraction is essential. We want to transform each text in a vector form, in which each document is represented by the presence (or frequency) of some “important” terms; this terms are the ones contained in the collection vocabulary. To build the vocabulary, various operations are typically performed (many of which are language-specific):

  • Word parsing and tokenization
    In this phase, each document is analyzed with the purpose of extracting the terms. Separator characters must be defined, along with a tokenization strategy for particular cases such as accented words, hyphenated words, acronyms, etc.
  • Stop-words removal
    A very common technique is the elimination of frequent usage words: conjunctions, prepositions, base verbs, etc. This kind of terms should be filtered as they have a poor characterizing power, making them useless for the text classification.
  • Lemmatization and stemming
    The lemmatization of a word is the process of determining its lemma. The lemma can be thought of as the “common root” of various related inflectional forms; for instance, the words walk, walking and walked all derive from the lemma walk.
    A simple technique for approximated lemmatization is the stemming. Stemming algorithms work by removing the suffix of the word, according to some grammatical rules.
  • Term selection/feature extraction
    The term set resulting from the previous phases has still to be filtered, since we need to remove the terms that have poor prediction ability (w.r.t the document class) or are strongly correlated to other terms. This term selection task also leads to a simpler and more efficient classification.

To perform the preprocessing in WEKA, I used the StringToWordVector filter from the package weka.filters.unsupervised.attribute. This filter allows to configure the different stages of the term extraction (fig.3). Indeed, you can:

  • Configure the tokenizer (term separators);
  • Specify a stop-words list;
  • Choose a stemmer.

The preprocessing operations will affect the quality of the classification,  for this reason I will perform various experiments on different generated datasets.

Fig. 3: StringToWordVector filter configuration
Fig. 3: StringToWordVector filter configuration

 The default text retrieval model used by the StringToWordVector filter is boolean: each document is represented with an n-dimensional boolean vector, where n is the size of the vocabulary, and each value models the presence or the absence of a vocabulary term in the document. One can also choose to use a frequency-based model such as the TF-IDF weighting model by setting to true the TFTransform and IDFTransform parameters.

You can set a stop-words list by clicking on stopwords and setting to true the useStopList parameter. In my experiments I used a 630 english stop-words list (whose origin I don’t recall) and the Porter’s stemmer (for the english language). Before you can use it, you must download and add to the classpath the snowball stemmers library.

Furthermore, you can set a maximum limit on the number of words to be extracted by changing the wordsToKeep parameter (default is 1000 words) and a minimum document frequency for each term by means of the minTermFreq parameter (default is 1). The latter parameter makes the filter drop the terms that appear in less than minTermFreq documents (of the same class – see note). NOTE:  by default, these two parameters are considered with respect to the document class. To have them applied without considering the class, set DoNotOperateOnPerClassBasis to true.

After applying the StringToWordVector filter, we get the result shown in figure 4.

Fig. 4: after term extraction
Fig. 4: after term extraction

 We get a relation containing 1251 binary attributes. The histogram shows the distribution of the term ‘bad‘ among the documents:  mostly, it appears (value 1) in negative reviews (blue color).

The last preprocessing operation is the attribute selection. Like I said, eliminating the poorly characterizing attributes can be useful to get a better classification accuracy. For this task, WEKA provides the AttributeSelection filter from the weka.filters.supervised.attribute package. The filter allows to choose an attribute evaluation method and a search strategy (fig. 5).

Fig. 5: AttributeSelection filter parameters

 The default evaluation method is CfsSubsetEval (Correlation-based feature subset selection). This method works by choosing attributes that are highly correlated with the class attribute while having a low correlation with other attributes.

After applying the AttributeSelection filter, we obtain the result show in figure 6.

Fig. 6: after applying the AttributeSelection filter

 As you can see, the number of attributes has greatly decreased, passing from more than 1000 to just 52.


The classification problem is a supervised learning task that consists in assigning a class label to an unclassified tuple according to an already classified instance set, that is used as a training set for the algorithm.

The two classification algorithms that I’m going to use are Naive Bayes and k-nearest neighbors (k = 1, 3, 5). The quality measure that will be considered is the percentage of correctly classified instances. For the validation phase I’ll use the 10-fold cross validation method.

Naive bayes classifier

The Naive Bayes classifier assigns a class to an unlabeled object according to a maximum likelihood principle. More specifically, the instance is assigned the class which maximizes the “a posteriori” probability,  which is a function of the class prior probability and of the instance likelihood w.r.t to the class (fig. 7).

Fig. 7: Naive Bayes classifier (1)

This classifier is called naive because it assumes the attribute indipendence hypothesis. With that assumption, the computation of the conditional probability P(d | c) becomes just a calculation of a product between the probabilities of each attribute (fig. 8).

Fig. 8: Naive Bayes classifier (2)

In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes component from the weka.classifiers.bayes package. The best result achieved with this classifier has shown a correctness percentage of 81,45% (fig. 9), using a dataset on which only attribute selection was performed (no stop-words removal, no stemming – see comparison table).

Fig. 9: best classification result with Naive Bayes

K-nearest neighbors classifier

The k-nearest neighbors classifier assigns to an instance the class that is prevalent among the k nearest instances. For this to be possible, a distance function between the instances must be defined. In this experiment the euclidean distance was used. The k parameter is chosen to be an odd number, so that a majority always exists. When k = 1, the classifier simply becomes a nearest neighbor classifier.

In WEKA, the k-NN classifier is implemented in the weka.classifiers.lazy.IBk component. As already said, NN, 3-NN and 5-NN were used in this experiment. The best result achieved with this kind of classifiers has shown a correctness percentage of 73,95% (fig. 10), using the 3-nearest neighbors classifier on a dataset on which stop-words removal, stemming and attribute selection were performed (see comparison table).

Fig. 10: best classification result with k-nearest neighbors (3-NN)

Comparison of results