Yearly Archives: 2014

A data mining experiment: movie reviews classification using WEKA

In this post I’m going to show a simple machine learning experiment where I perform a sentiment classification task on a movie reviews dataset using WEKA, an open source data mining tool. The goal is  to classify a movie review as positive or negative (for the reviewed movie).

I start by importing the reviews dataset in WEKA, then I perform some text preprocessing tasks such as word extraction, stop-words removal, stemming and term selection. Finally, I run various classification algorithms (naive bayes, k-nearest neighbors) and I compare the results, in terms of classification accuracy.

The movie reviews dataset

The dataset consists of 2000 user-created movie reviews archived on the IMDb (Internet Movie Database) web portal at http://reviews.imdb.com/Reviews and is known as “Sentiment Polarity Dataset version 2.0” (http://www.cs.cornell.edu/People/pabo/movie-review-data). The reviews are equally partitioned into a positive set and a negative set (1000+1000).

Each review consist of a plain text file (.txt) and a class label representing the overall user opinion. The class attribute has only two values: pos (positive) or neg (negative). The dataset creators explain that the class assignment to each review has been determined by using some simple rules that consider the user’s vote for the movie, as extracted from the original review. For instance, if a 1-10 scale is used, a vote greater or equal to 6 is considered as positive and anything less than 6 as negative. The authors also state that similar criteria have been used in the case of different vote scales (e.g., 1-5, A-F, etc.).

Importing the dataset in WEKA

First thing to be done is to import the dataset in the WEKA tool. Practically speaking, we have an archive containing 2000 text files partitioned in two sub-directories pos and neg (the class values). WEKA provides a simple import procedure for textual datasets, by means of the TextDirectoryLoader component. By using this loader (fig. 1), WEKA automatically creates a relation with 2 attributes: the first one contains the text data, the second is the document class, as determined by the sub-directory containing the file (pos or neg).

TextDirectoryLoader
Fig. 1: importing the dataset in WEKA

After the import, you can save the dataset in the ARFF format and manually edit it with a text editor to set some decent names for the relation and the attributes. Figure 2 shows the resulting relation.

Fig. 2: imported dataset and class distribution
Fig. 2: imported dataset and class distribution

As expected, we get a relation containing 2000 instances and two attributes (text and class). The histogram in the figure shows the uniform distribution of the review classes (blue = negative, red = positive).

Text preprocessing and feature extraction in WEKA

 For the classification task to be done, a preliminary phase of text preprocessing and feature extraction is essential. We want to transform each text in a vector form, in which each document is represented by the presence (or frequency) of some “important” terms; this terms are the ones contained in the collection vocabulary. To build the vocabulary, various operations are typically performed (many of which are language-specific):

  • Word parsing and tokenization
    In this phase, each document is analyzed with the purpose of extracting the terms. Separator characters must be defined, along with a tokenization strategy for particular cases such as accented words, hyphenated words, acronyms, etc.
  • Stop-words removal
    A very common technique is the elimination of frequent usage words: conjunctions, prepositions, base verbs, etc. This kind of terms should be filtered as they have a poor characterizing power, making them useless for the text classification.
  • Lemmatization and stemming
    The lemmatization of a word is the process of determining its lemma. The lemma can be thought of as the “common root” of various related inflectional forms; for instance, the words walk, walking and walked all derive from the lemma walk.
    A simple technique for approximated lemmatization is the stemming. Stemming algorithms work by removing the suffix of the word, according to some grammatical rules.
  • Term selection/feature extraction
    The term set resulting from the previous phases has still to be filtered, since we need to remove the terms that have poor prediction ability (w.r.t the document class) or are strongly correlated to other terms. This term selection task also leads to a simpler and more efficient classification.

To perform the preprocessing in WEKA, I used the StringToWordVector filter from the package weka.filters.unsupervised.attribute. This filter allows to configure the different stages of the term extraction (fig.3). Indeed, you can:

  • Configure the tokenizer (term separators);
  • Specify a stop-words list;
  • Choose a stemmer.

The preprocessing operations will affect the quality of the classification,  for this reason I will perform various experiments on different generated datasets.

Fig. 3: StringToWordVector filter configuration
Fig. 3: StringToWordVector filter configuration

 The default text retrieval model used by the StringToWordVector filter is boolean: each document is represented with an n-dimensional boolean vector, where n is the size of the vocabulary, and each value models the presence or the absence of a vocabulary term in the document. One can also choose to use a frequency-based model such as the TF-IDF weighting model by setting to true the TFTransform and IDFTransform parameters.

You can set a stop-words list by clicking on stopwords and setting to true the useStopList parameter. In my experiments I used a 630 english stop-words list (whose origin I don’t recall) and the Porter’s stemmer (for the english language). Before you can use it, you must download and add to the classpath the snowball stemmers library.

Furthermore, you can set a maximum limit on the number of words to be extracted by changing the wordsToKeep parameter (default is 1000 words) and a minimum document frequency for each term by means of the minTermFreq parameter (default is 1). The latter parameter makes the filter drop the terms that appear in less than minTermFreq documents (of the same class – see note). NOTE:  by default, these two parameters are considered with respect to the document class. To have them applied without considering the class, set DoNotOperateOnPerClassBasis to true.

After applying the StringToWordVector filter, we get the result shown in figure 4.

Fig. 4: after term extraction
Fig. 4: after term extraction

 We get a relation containing 1251 binary attributes. The histogram shows the distribution of the term ‘bad‘ among the documents:  mostly, it appears (value 1) in negative reviews (blue color).

The last preprocessing operation is the attribute selection. Like I said, eliminating the poorly characterizing attributes can be useful to get a better classification accuracy. For this task, WEKA provides the AttributeSelection filter from the weka.filters.supervised.attribute package. The filter allows to choose an attribute evaluation method and a search strategy (fig. 5).

Fig. 5: AttributeSelection filter parameters

 The default evaluation method is CfsSubsetEval (Correlation-based feature subset selection). This method works by choosing attributes that are highly correlated with the class attribute while having a low correlation with other attributes.

After applying the AttributeSelection filter, we obtain the result show in figure 6.

Fig. 6: after applying the AttributeSelection filter

 As you can see, the number of attributes has greatly decreased, passing from more than 1000 to just 52.

Classification

The classification problem is a supervised learning task that consists in assigning a class label to an unclassified tuple according to an already classified instance set, that is used as a training set for the algorithm.

The two classification algorithms that I’m going to use are Naive Bayes and k-nearest neighbors (k = 1, 3, 5). The quality measure that will be considered is the percentage of correctly classified instances. For the validation phase I’ll use the 10-fold cross validation method.

Naive bayes classifier

The Naive Bayes classifier assigns a class to an unlabeled object according to a maximum likelihood principle. More specifically, the instance is assigned the class which maximizes the “a posteriori” probability,  which is a function of the class prior probability and of the instance likelihood w.r.t to the class (fig. 7).

Fig. 7: Naive Bayes classifier (1)

This classifier is called naive because it assumes the attribute indipendence hypothesis. With that assumption, the computation of the conditional probability P(d | c) becomes just a calculation of a product between the probabilities of each attribute (fig. 8).

Fig. 8: Naive Bayes classifier (2)

In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes component from the weka.classifiers.bayes package. The best result achieved with this classifier has shown a correctness percentage of 81,45% (fig. 9), using a dataset on which only attribute selection was performed (no stop-words removal, no stemming – see comparison table).

Fig. 9: best classification result with Naive Bayes

K-nearest neighbors classifier

The k-nearest neighbors classifier assigns to an instance the class that is prevalent among the k nearest instances. For this to be possible, a distance function between the instances must be defined. In this experiment the euclidean distance was used. The k parameter is chosen to be an odd number, so that a majority always exists. When k = 1, the classifier simply becomes a nearest neighbor classifier.

In WEKA, the k-NN classifier is implemented in the weka.classifiers.lazy.IBk component. As already said, NN, 3-NN and 5-NN were used in this experiment. The best result achieved with this kind of classifiers has shown a correctness percentage of 73,95% (fig. 10), using the 3-nearest neighbors classifier on a dataset on which stop-words removal, stemming and attribute selection were performed (see comparison table).

Fig. 10: best classification result with k-nearest neighbors (3-NN)

Comparison of results

 

 

 

Java implementation of a max flow algorithm on a graph

In this post I’ll describe a Java implementation of a fast maximum flow algorithm, known as the Ahuja-Orlin max flow algorithm. This algorithm implementation is part of a small and easy to use Java class library which can be used to model a flow graph, along with its nodes and edges, and to find the maximum flow that can be sent from a source node to a sink node. This library also provides a convenient way to calculate the maximum flow on a subgraph.

The Ahuja-Orlin max flow algorithm

The algorithm I chose to implement is Improved shortest augmenting path from Ahuja and Orlin. I chose it after finding that it came out as the best performing algorithm in an experimental analysis of max-flow graph algorithms from TopCoder.

This algorithms uses the concepts of shortest augmenting path and distance labels. The algorithm iteratively searches the shortest augmenting path in the residual network of the graph. It terminates when some conditions on the distance labels are met: these conditions indicate that the flow distribution is optimal and so no other augmenting paths exist.

In the following listing, the algorithm pseudocode is reported (source: TopCoder). The source node is called s and the sink node is called t.

improved-shortest-augmenting-path-alg

Given an input graph, the algorithm initially builds the associated residual network. To obtain a residual network, back-edges are added to the graph: for each edge (i, j), a (j, i) edge with zero capacity and flow is created.

Then, an initial labelling of the nodes is performed. Each node is associated with a distance label d[i] representing the length, in term of nodes, of the shortest path between i and the sink node in the residual network, having d[t] = 0. This labelling is done by means of a reverse breadth-first search of the graph, starting from the sink node.

If, in the iterations of the algorithm, the distance label of node s becomes greater or equal to the number of nodes, no more augmenting paths can exist in the residual network; this is the main termination criterion for the algorithm.
An edge (i, j) is called admissible if d[i] = d[j] + 1. The shortest augmenting path between s and t is defined as a s -> t path consisting only of admissible edges, each one with a residual capacity rij > 0.

The algorithm is composed of 4 main procedures: the main cycle, the advance procedure, the retreat procedure, and the augment procedure. In the main cycle, the algorithm iteratively tries to build an augmenting path from s to t, by using admissible edges.

If an admissible edge can be taken from the current node, an advance is performed, by means of which the path and the current node are updated. If the t node is reached, the augment takes place, which calculates the flow increment of the path, updates the graph, and resets s as the current node.

If, conversely, the are no admissible outbound edges from the current node, the retreat operation is executed, which increments the current node’s distance label and backtracks the augmenting path to the previous node.

An additional termination criterion is also used. During the generic iterations, the algorithm keeps track of the distance labels distribution over the nodes. When a node’s distance label is incremented, the algorithm tests whether the number of nodes having the previous label value has become 0: in this case, there is the guarantee that no more augmenting paths exist, and so the algorithm terminates.

Ahuja-Orlin Java implementation

To implement the solution I had to design the graph first, which is the data structure the algorithm runs on. I implemented a directed graph suitable for flow problems, with limited capacity edges. The graph internal representation consist of adiacency and incidency lists; this allows an easier navigation of the graph: for a generic node, both the adjacent and incident edges can be obtained. This solution also allows to optimize some operations executed by the max flow algorithm. The following diagrams show the structure and the relationships of the 5 modules of the solution: the Graph interface and the FlowGraph, Node, Edge, and MaxFlowCalculator classes.

class-node-edgeThe Edge and Node classes model the basic components of the graph. The edges use double values to model the edge capacity and flow. The nodes are represented with a integer id and a label (string).

graph-interfaceThe graph interface (of which FlowGraph is a concrete implementation) exposes the common methods for the graph manipulation and navigation, and additionally :

  • a clone() method that returns a “deep copy” of the graph;
  • The setSource(Node) and setSink(Node) methods, used to specify the source and the sink nodes;
  • a getSubGraph(Set<Integer> s) method that returns the induced subgraph from a given node subset (ids) and from the source and sink nodes.

The max flow algorithm has been implemented in the MaxFlowCalculator class. This class, as represented by the following diagram, exposes one public and static method double getMaxFlow(Graph) which calculates and returns the maximum flow value for the input graph.

maxflowcalculator

The various steps of the algorithm have been implemented in different private methods for clarity purposes. The distance labels are managed by a DistanceLabels class, which provides methods for obtaining and setting the node labels, and keeps track of the label distribution (this is useful to implement the algorithm’s secondary termination criterion). The getMaxFlow method implementation is reported below.

public static double getMaxFlow(Graph g)
{
    if(g.numNodes() == 0)
    {
        return 0;
    }

    DistanceLabels labels = calcDistanceLabels(g);
    double f = 0; // max flow
    int n = g.numNodes();
 // add back edges to create residual network
    List<Edge> backEdges = addBackEdges(g);
 // current augmenting path
    LinkedList<Edge> path = new LinkedList<Edge>(); 
 // distance of source from the sink
    int sourceDist;
    Node i = g.getSource(); // current node

 /*
   Main termination criterion: distance label of source node is greater or
   equal to the number of nodes. In that case, no more augmenting paths
   from the source to the sink can exist.
 */

    while(i != null && (sourceDist = labels.getLabel(g.getSource())) < n)
    {
        Edge e = getAdmissibleEdge(g, i, labels);
        if(e != null)
        {
            i = advance(e, path);
            if(i.equals(g.getSink()))
            {
                double delta = augment(g, path);
                f += delta;
                i = g.getSource();
                path.clear();
            }
        }
        else i = retreat(g, labels, i, path);
    }

    removeBackEdges(g, backEdges);

    return f;
}

Usage example

Let’s show how this Java class library can be used to calculate the maximum flow on a simple test graph, reported in the following figure:

graph

This is a flow graph containing 7 nodes and 12 edges; the a node is the source and the b node is the sink. First of all, let’s see how to build the graph:

private static Graph buildTestGraph()
{
    Graph graph = new FlowGraph();

    Node na = new Node(0, "A");
    Node n1 = new Node(1, "N1");
    Node n2 = new Node(2, "N2");
    Node n3 = new Node(3, "N3");
    Node n4 = new Node(4, "N4");
    Node n5 = new Node(5, "N5");
    Node nb = new Node(6, "B");

    graph.addNode(na);
    graph.setSource(na);
    
    graph.addNode(n1);
    graph.addNode(n2);
    graph.addNode(n3);
    graph.addNode(n4);
    graph.addNode(n5);
    
    graph.addNode(nb);
    graph.setSink(nb);

    graph.addEdge(new Edge(na, n1, 3));
    graph.addEdge(new Edge(na, n3, 5));
    graph.addEdge(new Edge(na, n2, 3));
    graph.addEdge(new Edge(n1, n3, 4));
    graph.addEdge(new Edge(n1, n4, 3));
    graph.addEdge(new Edge(n3, n4, 2));
    graph.addEdge(new Edge(n3, nb, 1));
    graph.addEdge(new Edge(n3, n5, 4));
    graph.addEdge(new Edge(n2, n3, 2));
    graph.addEdge(new Edge(n2, n5, 2));
    graph.addEdge(new Edge(n4, nb, 4));
    graph.addEdge(new Edge(n5, nb, 4));

    return graph;
}

Now let’s calculate the max flow that can be sent from a to b on the whole graph:

// build the test graph
Graph g = buildTestGraph();
    
// calculate max flow from source to sink on g
double f = MaxFlowCalculator.getMaxFlow(g);
System.out.println("max flow on g = " + f);
// print the flow distribution on g
System.out.println("flow distribution on g = " + g.getEdges());

The output is:

max flow on g = 9.0
flow distribution on g = [(0, 1) [3.0 / 3.0], (0, 3) [5.0 / 5.0], (0, 2) [1.0 / 3.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (2, 3) [0.0 / 2.0], (2, 5) [1.0 / 2.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (3, 5) [3.0 / 4.0], (4, 6) [4.0 / 4.0], (5, 6) [4.0 / 4.0]]

So the maximum flow between a and b on the entire network is 9. The output also shows the flow distribution (obtained with the getEdges() method of the graph). Now let’s see how the maximum flow changes if only a part of the graph is used to route the flow from the source to the sink.

The subgraph induced from the 1, 3 and 4 nodes is the following:

graph-134

This subgraph is obtained from the original test graph, by removing the 2 and 5 nodes and the associated edges. The code for obtaining the subgraph and calculating the max flow follows:

// calculate the max flow from source to sink using
// the subgraph induced from nodes 1, 3, 4
Set<Integer> s134 = new HashSet<Integer>();
s134.add(1);
s134.add(3);
s134.add(4);
Graph g134 = buildTestGraph().getSubGraph(s134);
double f134 = MaxFlowCalculator.getMaxFlow(g134);
System.out.println("max flow on g[1, 3, 4] = " + f134);
System.out.println("flow distribution on g[1, 3, 4] = " +
    g134.getEdges());

To get the subgraph, you must call getSubGraph on the graph object, and pass a set of node ids (1, 3 and 4, in this case). The result we get is:

max flow on g[1, 3, 4] = 5.0
flow distribution on g[1, 3, 4] = [(0, 1) [3.0 / 3.0], (0, 3) [2.0 / 5.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (4, 6) [4.0 / 4.0]]

As expected, using a smaller network, the max flow that can be sent is lower. Finally, let’s consider an edge case (the irony), in which we can’t have any flow between the two nodes. The subgraph induced from node 2 is:

graph-2

In this case, no path from the source node to the sink node exists,  so we expect the flow to be 0. The code is very similar to the previous examples:

Set<Integer> s2 = new HashSet<Integer>();
s2.add(2);
Graph g2 = buildTestGraph().getSubGraph(s2);
double f2 = MaxFlowCalculator.getMaxFlow(g2);
System.out.println("max flow on g[2] = " + f2);
System.out.println("flow distribution on g[2] = " + g2.getEdges());

The output is:

max flow on g[2] = 0.0
flow distribution on g[2] = [(0, 2) [0.0 / 3.0]]

Download source code

 

Implementazione Java di un algoritmo per il problema del massimo flusso su grafo

Recentemente ho dovuto implementare un algoritmo per il calcolo del massimo flusso su grafo per un progetto universitario. Il problema da risolvere era il calcolo del massimo flusso inviabile da un nodo sorgente (source node) a un nodo pozzo (sink node) utilizzando il grafo, oppure solo una parte di esso. C’era infatti la necessità di calcolare il massimo flusso al variare del sottografo indotto da un certo insieme dei nodi.

L’algoritmo doveva andare a integrarsi in un progetto più grande, quindi oltre ai requisiti di efficienza c’era la necessità di costruire una soluzione strutturata e modulare.

Algoritmo per il massimo flusso utilizzato

L’algoritmo che ho scelto di implementare è Improved shortest augmenting path di Ahuja e Orlin. Ho scelto questo algoritmo dopo aver visto che è risultato il più performante in un’analisi sperimentale degli algoritmi per il massimo flusso su grafo condotta da TopCoder.

Questo algoritmo si basa sulle idee di minimo cammino aumentante e etichette di distanza (distance label). L’algoritmo prevede, in generale, di cercare iterativamente il più breve cammino aumentante sulla rete residuale del grafo, e di terminare quando vengono soddisfatte alcune condizioni sulle distance label.

Nella figura seguente è riportato lo pseudocodice dell’algoritmo (fonte: TopCoder). Il nodo sorgente è indicato con s e il nodo pozzo con t.

improved-shortest-augmenting-path-alg

Dato un grafo, l’algoritmo inizialmente costruisce la rete residuale associata. Per ottenere una rete residuale, vengono aggiunti dei back-edge: per ogni arco (i, j) viene aggiunto un arco (j, i) con capacità e flusso nulli.

Sempre nella fase iniziale, viene effettuata una pre-etichettatura dei nodi del grafo. Ad ogni nodo i viene associata una distance label d[i] che rappresenta la lunghezza, in termini di numero di nodi, del cammino minimo tra i e il nodo pozzo nella rete residuale, con d[t] = 0. Quest’etichettatura è implementata mediante una visita in ampiezza al contrario del grafo, che parte dal nodo pozzo.

Se durante le iterazioni dell’algoritmo, la distance label associata al nodo s diventa maggiore o uguale al numero dei nodi del grafo, non ci potrà più essere un cammino aumentante nella rete residuale; questo è il principale criterio di terminazione dell’algoritmo.
Un arco (i, j) è definito ammissibile se d[i] = d[j] + 1. Se un cammino è costituito da archi ammissibili, e ciascun arco ha una capacità residua rij > 0, quel cammino è il più breve cammino aumentante tra s e t.

L’algoritmo si compone di 4 procedure principali: il ciclo principale, la procedura di advance, la procedura di retreat, e la procedura di augment. Nel ciclo principale, l’algoritmo cerca iterativamente di costruire un cammino aumentante da s a t, utilizzando archi ammissibili.

Se dal nodo corrente è possibile prendere un arco uscente ammissibile, viene effettuata una advance, mediante la quale viene aggiornato il cammino e il nodo corrente. Se si raggiunge il nodo t viene effettuato l’augment, che calcola l’incremento di flusso del cammino, aggiorna il grafo, e reimposta il nodo s come nodo corrente.

Se, al contrario, non ci sono archi uscenti ammissibili dal nodo corrente, viene effettuata l’operazione retreat, che retrocede il costruendo cammino aumentante al nodo precedente e incrementa la distance label del nodo corrente.

L’algoritmo prevede inoltre un criterio di terminazione aggiuntivo. Durante le iterazioni dell’algoritmo, si tiene traccia della distribuzione dei valori delle distance label sui nodi. Quando, nell’aggiornamento delle etichette, si incrementa la distance label di un nodo, si verifica se il numero di nodi che hanno associato il valore precedente dell’etichetta scende a 0: in questo caso, c’è la garanzia che non potrà più esistere un cammino aumentante nel grafo, e l’algoritmo termina.

Implementazione Java

Per implementare la soluzione è stato necessario innanzitutto progettare il grafo, che è la struttura dati sulla quale l’algoritmo esegue. Ho implementato un grafo orientato adatto ai problemi di flusso, con archi di capacità limitata. Il grafo è rappresentato internamente mediante liste di adiacenza e di incidenza; questo permette di navigare il grafo più facilmente: per un nodo è possibile richiedere, oltre alla lista degli archi adiacenti, anche quella degli archi incidenti. Questo consente di ottimizzare alcune operazioni che verranno eseguite dall’algoritmo max flow.

Nei diagrammi seguenti viene mostrata la struttura e le relazioni dei 5 moduli del progetto: l’interfaccia Graph e le classi FlowGraph, Node, Edge, e MaxFlowCalculator.

class-node-edgeLe classi Edge e Node rappresentano gli archi e i nodi del grafo. Per gli archi vengono modellati, con valori double, la capacità e il flusso. I nodi sono rappresentati da un id intero e un’etichetta (stringa).

graph-interfaceL’interfaccia del grafo (di cui FlowGraph è un’implementazione concreta) prevede i principali metodi per la manipolazione e la navigazione della struttura, oltre a:

  • un metodo clone() che restituisce una copia “profonda” del grafo;
  • i metodi setSource(Node) e setSink(Node) per specificare quali sono i nodi sorgente e pozzo;
  • un metodo getSubGraph(Set<Integer> s) che restituisce il sottografo indotto da un certo sottoinsieme di nodi (id) e dai nodi source e sink.

L’algoritmo max flow è stato implementato nella classe MaxFlowCalculator. Questa classe, come rappresentato nel diagramma successivo, espone un unico metodo pubblico e statico double getMaxFlow(Graph) che calcola e restituisce il valore del massimo flusso sul grafo in input.

maxflowcalculator

I vari passi dell’algoritmo sono stati implementati in metodi privati separati per ragioni di chiarezza. Le etichette di distanza vengono gestite da una classe DistanceLabels, che fornisce i metodi per ottenere e impostare le etichette dei nodi, e tiene traccia della loro distribuzione (questo è utile per implementare il criterio di terminazione secondario dell’algoritmo).

Esempio di utilizzo

Vediamo come è possibile usare questa libreria di classi per calcolare il massimo flusso su un semplice grafo di test, riportato nella figura seguente:

graph

 Si tratta di un grafo di flusso con 7 nodi e 12 archi; il nodo a è la sorgente e il nodo b il pozzo. Vediamo innanzitutto come costruire il grafo:

private static Graph buildTestGraph()
{
    Graph graph = new FlowGraph();

    Node na = new Node(0, "A");
    Node n1 = new Node(1, "N1");
    Node n2 = new Node(2, "N2");
    Node n3 = new Node(3, "N3");
    Node n4 = new Node(4, "N4");
    Node n5 = new Node(5, "N5");
    Node nb = new Node(6, "B");

    graph.addNode(na);
    graph.setSource(na);
    
    graph.addNode(n1);
    graph.addNode(n2);
    graph.addNode(n3);
    graph.addNode(n4);
    graph.addNode(n5);
    
    graph.addNode(nb);
    graph.setSink(nb);

    graph.addEdge(new Edge(na, n1, 3));
    graph.addEdge(new Edge(na, n3, 5));
    graph.addEdge(new Edge(na, n2, 3));
    graph.addEdge(new Edge(n1, n3, 4));
    graph.addEdge(new Edge(n1, n4, 3));
    graph.addEdge(new Edge(n3, n4, 2));
    graph.addEdge(new Edge(n3, nb, 1));
    graph.addEdge(new Edge(n3, n5, 4));
    graph.addEdge(new Edge(n2, n3, 2));
    graph.addEdge(new Edge(n2, n5, 2));
    graph.addEdge(new Edge(n4, nb, 4));
    graph.addEdge(new Edge(n5, nb, 4));

    return graph;
}

Ora vediamo come calcolare il massimo flusso inviabile tra a e b utilizzando l’intero grafo:

// costruisci il grafo di test
Graph g = buildTestGraph();
    
// calcola il massimo flusso tra sorgente e pozzo su g
double f = MaxFlowCalculator.getMaxFlow(g);
System.out.println("max flow on g = " + f);
// stampa la distribuzione di flusso di g
System.out.println("flow distribution on g = " + g.getEdges());

L’output è:

max flow on g = 9.0
flow distribution on g = [(0, 1) [3.0 / 3.0], (0, 3) [5.0 / 5.0], (0, 2) [1.0 / 3.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (2, 3) [0.0 / 2.0], (2, 5) [1.0 / 2.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (3, 5) [3.0 / 4.0], (4, 6) [4.0 / 4.0], (5, 6) [4.0 / 4.0]]

Il massimo flusso inviabile tra a e b utilizzando l’intera rete è quindi 9. In output viene stampata anche la distribuzione di flusso (attraverso il metodo getEdges() del grafo). Vediamo ora come cambia il massimo flusso se si utilizza solo una parte del grafo di test.

Il sottografo indotto dai nodi 1, 3 e 4 è il seguente:

graph-134

Questo sottografo è ottenuto dal grafo di test originale, eliminando i nodi 2, 5, e gli archi associati. Segue il codice per ottenere il sottografo e calcolare il massimo flusso:

// calcola il massimo flusso tra sorgente e pozzo utilizzando il
// sottografo indotto dai nodi 1, 3 e 4
Set<Integer> s134 = new HashSet<Integer>();
s134.add(1);
s134.add(3);
s134.add(4);
Graph g134 = buildTestGraph().getSubGraph(s134);
double f134 = MaxFlowCalculator.getMaxFlow(g134);
System.out.println("max flow on g[1, 3, 4] = " + f134);
System.out.println("flow distribution on g[1, 3, 4] = " +
    g134.getEdges());

Per ottenere il sottografo è necessario utilizzare il metodo getSubGraph del grafo, al quale va passato come parametro un insieme di id di nodi (in questo caso 1, 3 e 4). L’output ottenuto è:

max flow on g[1, 3, 4] = 5.0
flow distribution on g[1, 3, 4] = [(0, 1) [3.0 / 3.0], (0, 3) [2.0 / 5.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (4, 6) [4.0 / 4.0]]

Utilizzando una rete più piccola, il massimo flusso inviabile è inferiore. Consideriamo infine un caso limite, nel quale non è possibile avere flusso tra i due nodi. Il sottografo indotto dal nodo 2 è:

graph-2

In questo caso non esiste un cammino tra la sorgente e il pozzo, quindi ci si aspetta che il flusso sia nullo. Il codice è molto simile ai casi precedenti:

Set<Integer> s2 = new HashSet<Integer>();
s2.add(2);
Graph g2 = buildTestGraph().getSubGraph(s2);
double f2 = MaxFlowCalculator.getMaxFlow(g2);
System.out.println("max flow on g[2] = " + f2);
System.out.println("flow distribution on g[2] = " + g2.getEdges());

Il risultato ottenuto è:

max flow on g[2] = 0.0
flow distribution on g[2] = [(0, 2) [0.0 / 3.0]]

Download codice sorgente