Release versioning of web application in Bamboo build

Automating the build and deploy workflow in Atlassian Bamboo is a useful practice to streamline the delivery of a web application. In this post I’m sharing a method to do automatic release versioning of a Javascript web application. With this method you can declare the application version in just one place and have Bamboo assign it to each build, along with the build number, such as:

2.0.0.85
2.0.0.86
...

The first thing to think about is where to store the application version. In the case of a project with NodeJS dependencies, we already have a package.json file in the project root directory where we can store the current version:

{
    "name": "My Application",
    "author": "John Doe",
    "version": "2.0.0",
     "devDependencies": {
         "bower": "*",
         "grunt": "^0.4.0"
     },
     "engines": {
         "node": ">=0.10.0"
     }
}

The next step is having Bamboo read the current version from package.json during the build process. To do that, we can leverage the Inject Bamboo variables task, which reads variables from a text file in “key=value” format.  So we first have to extract the version from package.json and write it to a text file. We can do this by adding a new Script task to the build job. The inline shell script we’ll write will use Node to read the version from package.json, append the Bamboo build number and write it to a text file with the expected format:

app_version="$(node -e 'console.log(require("./package.json").version);')"
version="${app_version}.${bamboo_buildNumber}"
echo "version=$version" > variables.txt

After the script task we can now add the Inject Bamboo variables task:

Bamboo Inject Variables from file task

The version variable will be stored in the bamboo.artifact namespace, and we can access it this way:

${bamboo.artifact.version}

The last step is to configure the release versioning of the deployment plan to use the custom generated version:

Bamboo custom release versioning

 

Developing a permission-based authorization system in a AngularJS app

In this post I’m going to show an implementation of a simple authentication and authorization system in a AngularJS web application. More in detail, the solution provides a declarative way to restrict access to both views and page content. Accessing user information and permissions from controllers and templates is also made really simple. This solution uses a service, a directive and session storage to implement a permission-based access control system.

The solution expects a back-end providing user profile data and permissions in this simple format:

{
   name: "John Doe",
   permissions: ['list_orders', 'read_statistics']
 // ...
}

Restricting access to views

Access control on views is set up on the $routeProvider configuration, so that you can annotate each route with two attributes: requiresAuthentication and permissions. The first is a boolean indicating if the user has to be logged in to access the view. The permissions element is a string array containing the requested permissions in a disjunctive (OR) fashion, so that the user is authorized to access the route if he owns at least one of the permissions.

angular
  .module('app', [
    'ngResource',
    'ngRoute',
    'AuthServices'
  ])
  .config(function ($routeProvider) {
    $routeProvider
      .when('/login', {
        templateUrl: 'views/login.html',
        controller: 'LoginCtrl'
      })
      .when('/home', {
        templateUrl: 'views/home.html',
        controller: 'HomeCtrl',
        requiresAuthentication: true
      })
      .when('/sales/orders/new', {
        templateUrl: 'views/sales/new_order.html',
        controller: 'OrderDetailCtrl',
        requiresAuthentication: true,
        permissions: ["administration"]
      })
      .when('/sales/orders', {
        templateUrl: 'views/sales/orders.html',
        controller: 'OrdersCtrl',
        requiresAuthentication: true,
        permissions: ["administration", "list_orders"]
      })
 });

Here we have a login view with public access, a home view accessible by all authenticated users, a new order view accessible only by users having the “administration” permission, and a order list view accessible by all users having an “administration” or “list_orders” permission.

Restricting access to page content

Another use case of the system is the access control on the page content. To show an element only to users having a particular permission, just use the permission directive on the DOM element. In this example, we have a navigation bar where we want to show only links to sections the user is authorized to access. As in the routes case, the permission attribute value is an array of permissions, so that the binded DOM element will be displayed only if the active user has at least one of them.

<div class="sidebar" ng-show="user" ng-cloak>
   <ul class="navigation">

      <li permission="['administration']">
         <span>Administration area</span>
         <ul>
           <li><a href="#/users">Users</a></li>
           <li><a href="#/settings">Settings</a></li>
         </ul>
      </li>

      <li permission="['administration', 'list_orders']">
         <span>Sales</span>
         <ul>
           <li permission="['administration']">
             <a href="#/sales/orders/new">New order</a>
           </li>
           <li permission="['administration', 'list_orders']">
              <a href="#/sales/orders">Orders list</a>
           </li>
         </ul>
      </li>

   </ul>

The Auth service and the Permission directive

The Auth service is the main component of the system, implementing the various functionalities. The user profile is saved in the session storage (wrapped by the ngStorage module) after login. A reference to the current user is also added to the root scope of the app automatically, to make it easy referencing it from the templates.

angular.module('AuthServices', ['ngResource', 'ngStorage'])
.factory('Auth', function($resource, $rootScope, $sessionStorage, $q){
    
    /**
     *  User profile resource
     */
    var Profile = $resource('/api/profile', {}, {
        login: {
            method: "POST",
            isArray : false
        }
    });
    
    var auth = {};
    
    /**
     *  Saves the current user in the root scope
     *  Call this in the app run() method
     */
    auth.init = function(){
        if (auth.isLoggedIn()){
            $rootScope.user = currentUser();
        }
    };
        
    auth.login = function(username, password){
        return $q(function(resolve, reject){
            Profile.login({username:username, password:password}).$promise
            .then(function(data) {                        
                $sessionStorage.user = data;    
                $rootScope.user = $sessionStorage.user;
                resolve();
            }, function() {
                reject();
            });
        });
    };
    

    auth.logout = function() {
        delete $sessionStorage.user;
        delete $rootScope.user;
    };
    
    
    auth.checkPermissionForView = function(view) {
        if (!view.requiresAuthentication) {
            return true;
        }
        
        return userHasPermissionForView(view);
    };
    
    
    var userHasPermissionForView = function(view){
        if(!auth.isLoggedIn()){
            return false;
        }
        
        if(!view.permissions || !view.permissions.length){
            return true;
        }
        
        return auth.userHasPermission(view.permissions);
    };
    
    
    auth.userHasPermission = function(permissions){
        if(!auth.isLoggedIn()){
            return false;
        }
        
        var found = false;
        angular.forEach(permissions, function(permission, index){
            if ($sessionStorage.user.user_permissions.indexOf(permission) >= 0){
                found = true;
                return;
            }                        
        });
        
        return found;
    };
    
    
    auth.currentUser = function(){
        return $sessionStorage.user;
    };
    
    
    auth.isLoggedIn = function(){
        return $sessionStorage.user != null;
    };
    

    return auth;
});

The permission directive can be bound to DOM elements and directly references the Auth service:

angular.module('app')   
.directive('permission', ['Auth', function(Auth) {
   return {
       restrict: 'A',
       scope: {
          permission: '='
       },

       link: function (scope, elem, attrs) {
            scope.$watch(Auth.isLoggedIn, function() {
                if (Auth.userHasPermission(scope.permission)) {
                    elem.show();
                } else {
                    elem.hide();
                }
            });                
       }
   }
}]);

 

Accessing current user data from controllers and templates

Accessing the user data and permissions is quite simple, like in this controller:

angular.module('app')
  .controller('OrdersCtrl', function ($scope, Auth, Sales) {

    if (Auth.userHasPermission(["administration"])){
        // some evil logic here
        var userName = Auth.currentUser().name;
        // ...
    }

});

All you have to do is to add a dependency on the Auth service. To access the user data in a template, simply reference user in root scope:

<div ng-show="user" ng-cloak>
   <span>{{user.full_name}}</span> <a ng-click="logout()">Logout</a>
</div>

 

Putting things together

Access control on views is implemented by listening on route change events. Another thing to recall is that the root scope of the application is reset when the page gets refreshed (e.g., after hitting F5). We handle both this issues in the app run method.

angular.module('app', [
    'ngResource',
    'ngRoute',
    'AuthServices'
])
.run(['$rootScope', '$location', 'Auth', function ($rootScope, $location, Auth) {
    Auth.init();
    
    $rootScope.$on('$routeChangeStart', function (event, next) {
        if (!Auth.checkPermissionForView(next)){
            event.preventDefault();
            $location.path("/login");
        }
    });
  }]);

By calling Auth.init here we create again a reference to the user data in the root scope.

Logging in and out

A login controller using the Auth service:

angular.module('app').controller('LoginCtrl', function($scope, $location, Auth) {

    $scope.email = "";
    $scope.password = "";
    $scope.failed = false;

    $scope.login = function() {
        Auth.login($scope.email, $scope.password)
          .then(function() {
              $location.path("/home");
          }, function() {
              $scope.failed = true;
          });
    };

});

Handling logout in the main application controller:

angular.module('app')
  .controller('MainCtrl', function ($scope, $rootScope, $location, Auth) {
      
      $rootScope.logout = function(){
        Auth.logout();
        $location.path("/login");
      };
      
  });

Conclusions

The proposed system is a prototype solution for handling authentication and authorization in a AngularJS app with a concept of user permissions. To implement this in your application, however, there are some back-end-specific details regarding the authentication that might have to be tuned to fit your own architecture.

Building large matrices faster with coo_matrix in Python

The SciPy python library provides many sparse matrix implementations,  including the dok_matrix and coo_matrix classes. These two formats are typically used for incrementally building a sparse matrix.
A dok_matrix (Dictionary Of Keys) is implemented as a hashtable structure, mapping each (row, column) key to a non-zero value.  A coo_matrix (COOrdinate) consists of three arrays rows, cols and data, containing the row indices, column indices, and the values, such that for the i-th non-zero element,  M[rows[i], cols[i]] = data[i].

Though slightly harder to build, a coo_matrix is much faster than a dok_matrix. In this example, I show how the coo_matrix outperforms the dok_matrix in loading a 8 million entries graph edge-list file in NCOL format, with the coo_matrix being about 7 times faster than the dok_matrix.

import csv
import numpy as np
from scipy.sparse import coo_matrix, dok_matrix
import timeit

def load_coo_matrix(file_path, separator=" ", transpose=True, default_weight=1.):
    with open(file_path) as file:
        reader = csv.reader(file, delimiter=separator)
        iterator = iter(reader)

        # data structures for the coo_matrix
        data = []
        rows = []
        cols = []

        nodes = 0
        for row in iterator:
            if row:
                source, target = int(row[0]), int(row[1])
                nodes = max(nodes, source, target)
                rows.append(source)
                cols.append(target)
                if len(row) > 2:
                    weight = float(row[2])
                else:
                    weight = default_weight

                data.append(weight)
        nodes += 1

        # create and return the coo_matrix
        data = np.array(data, copy=False)
        rows = np.array(rows, dtype=np.int32, copy=False)
        cols = np.array(cols, dtype=np.int32, copy=False)
        m = coo_matrix((data, (rows, cols) if not transpose else (cols, rows)), shape=(nodes, nodes))
        return m

def load_dok_matrix(file_path, max_dim, separator=" ", transpose=True, default_weight=1.):
    with open(file_path) as file:
        reader = csv.reader(file, delimiter=separator)
        iter_reader = iter(reader)
        m = dok_matrix((max_dim, max_dim))

        dim = 0
        for row in iter_reader:
            if row:
                source, target = int(row[0]), int(row[1])
                dim = max(dim, source, target)
                if len(row) > 2:
                    weight = float(row[2])
                else:
                    weight = default_weight

                if transpose:
                    m[target, source] = weight
                else:
                    m[source, target] = weight

        dim += 1
        return m[:dim, :dim]

def benchmark_coo_matrix():
    load_coo_matrix("C:/datasets/ff_8m")

def benchmark_dok_matrix():
    load_dok_matrix("C:/datasets/ff_8m", 8500000)

if __name__ == "__main__":
    LOOPS = 3
    coo_matrix_time = timeit.timeit("benchmark_coo_matrix()", setup='from __main__ '
                    'import benchmark_coo_matrix',   number=LOOPS) / LOOPS

    print("coo_matrix:", "%.2f" % coo_matrix_time, "sec")

    dok_matrix_time = timeit.timeit("benchmark_dok_matrix()", setup='from __main__'
                    ' import benchmark_dok_matrix', number=LOOPS) / LOOPS
    
    print("dok_matrix:", "%.2f" % dok_matrix_time, "sec")

 

Output

coo_matrix: 35.93 sec
dok_matrix: 266.86 sec

 

A simple Java wrapper for Apache Lucene

In this post I’m sharing an implementation of a simple Java wrapper class for the information retrieval library Apache Lucene, along with some usage examples.

About Apache Lucene

Apache Lucene is a popular and widely used open-source information retrieval library.
Lucene is considered as a powerful and flexible library and is used to build various types of search engines. It is mostly used for its full-text indexing and querying capabilities. In Lucene, each document is modeled as a set of fields containing text. This flexibility allows Lucene to be indipendent of the specific file format, so web pages, PDFs, Microsoft Word documents, etc. can all be indexed, as long as their textual content can be extracted.

Using Lucene: issues and restrictions

Correctly using the Lucene library isn’t trivial, because of the scarcity of useful documentation and also because of a number of technical issues that must be considered.

The lifecycle of the index, and of all the API components used to read and write it, must be properly managed to have a consistent, efficient and reliable index operation. Lucene uses two major components to access the index: IndexWriter and IndexSearcher. The first is used to create the index and to add and delete documents. The latter is used to perform search queries on it. One important restriction is that no more than one IndexWriter at a time can be opened on a given index directory, as pointed out in the IndexWriter documentation.  IndexWriter instances are considered thread-safe, so multiple threads can concurrently access one to modify the index.

The IndexSearcher component relies on a IndexReader instance, which offers a “frozen” view of the index. This implies that whenever the index is updated, the IndexReader must be reopened to see the changes, and a new IndexSearcher must be obtained from it. This can lead to some issues, because the old IndexReader instances must be closed to avoid memory leaks and an excessive number of open files on the file system. However, in a concurrent environment, closing an old IndexReader that is referenced by another thread may prevent it from searching the index, since the thread might occur in a AlreadyClosedException  (see also How to safely close an IndexReader).  In Lucene 3.5, a new SearcherManager component was introduced, that automatically manages the reopening and the recycling of IndexReaders in a concurrent setting.

The query construction in Lucene can be done in two different ways: by using a QueryParser object or by using Lucene’s query API. The QueryParser generates a Query object from a query string. That’s the simplest way to build a query, but it has some limitations: for instance, you cannot use it to perform a query by a numeric field.

Another tricky aspect in the Lucene API is that the documents aren’t assigned an ID. Indeed, a progressive number is assigned to a document, but it’s modified when other documents are deleted.  Therefore, it’s on the programmer to properly assign a unique ID to each document. Being able to access a document by its ID is a must in many application contexts.

IndexWrapper – a wrapper class for Lucene

This class provides methods for:

  • Adding new documents to the index, assigning each a unique ID
  • Searching the index, supporting boolean and phrase queries on string and integer fields
  • Deleting documents from the index
  • Getting the number of indexed documents

The query parameters are passed to the search method via a QueryParams object, which contains an arbitrary set of string and integer fields. The method generates a boolean OR query for the unquoted fields and a phrase query for double-quoted string fields. A NumericRangeQuery is generated for each integer field. The string fields are preprocessed through an analyzer, which usually includes a language-specific stop-word list and a stemmer. The default analyzer can be changed using the setAnalyzer(Analyzer) method.

This class is considered thread-safe, so multiple threads can use it to search the index and add/delete documents.
Moreover, this class fosters strict consistency over performance, since the index is committed every time a document is added or deleted, and the SearcherManager is refreshed before each query is executed.

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.UUID;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.it.ItalianAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.LongField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryparser.classic.ParseException;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.BooleanClause;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.NumericRangeQuery;
import org.apache.lucene.search.PhraseQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.SearcherManager;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;

/**
 *
 * @author Stefano Scerra (www.stefanoscerra.it)
 *
 * A wrapper class for Apache Lucene 4.9+ index management.
 * Thic class is designed to work with a file system-resident document index.
 * Provides methods to add new documents, perform searching, deletion, and more.
 * It automatically manages the index lifecycle, creating the index when
 * it doesn't exist, and efficiently reusing IndexSearcher/IndexReader instances
 * by using a SearcherManager.
 *
 */
public final class IndexWrapper
{    
    private static final Map<String, IndexWrapper> instancePool = new HashMap<String, IndexWrapper>();
    
    private final String indexPath;   
    // set the default analyzer
    private Analyzer analyzer = new ItalianAnalyzer(Version.LUCENE_4_9);    
    private IndexWriter indexWriter;
    private SearcherManager searchMgr;
    private boolean isOpen = false;
    
    /**
     *  Contains a set of string and integer fields to use in a search query.
     */
    public static class SearchParams
    {
        private Map<String, String> strings = new HashMap<String, String>();
        private Map<String, Integer> integers = new HashMap<String, Integer>();        

        public String getString(String fieldName)
        {
            return strings.get(fieldName);
        }

        public int getInt(String fieldName)
        {
            return integers.get(fieldName);
        }       
        
        public void putInt(String fieldName, int value)
        {
            integers.put(fieldName, value);
        }
        
        public void putString(String fieldName, String value)
        {
            strings.put(fieldName, value);
        }
        
        public Set<Map.Entry<String, String>> stringSet()
        {
            return strings.entrySet();
        }
        
        public Set<Map.Entry<String, Integer>> intSet()
        {
            return integers.entrySet();
        }

        public boolean isEmpty()
        {
            return strings.isEmpty() && integers.isEmpty();
        }
    }

     /**
     * Returns a unique instance of IndexWrapper for a given index.
     * @param indexPath the lucene index path on the file system
     * @return
     */
    public static synchronized IndexWrapper getInstance(String indexPath) throws IOException
    {
        IndexWrapper indexWrapper = instancePool.get(indexPath);
        if(indexWrapper != null) return indexWrapper;
        indexWrapper = new IndexWrapper(indexPath);
        instancePool.put(indexPath, indexWrapper);
        return indexWrapper;
    }
    
    private IndexWrapper(String indexPath) throws IOException
    {
        this.indexPath = indexPath;
        openIndex();
        initSearcherManager();
        isOpen = true;
    }
    
    
    public boolean isOpen()
    {
        return isOpen;
    }
    
    /**
     * Sets the analyzer to be used for indexing and querying.
     * @param analyzer
     */
    public void setAnalyzer(Analyzer analyzer)
    {
        this.analyzer = analyzer;
    }   
    
     /**
     * Opens the index on the file system.
     * If the index does not exist, it is created.
     * Must be called only once.
     * @throws IOException
     */
    private void openIndex() throws IOException
    {
        Directory dir = FSDirectory.open(new File(indexPath));
        IndexWriterConfig cfg = new IndexWriterConfig(Version.LUCENE_4_9, analyzer);
        cfg.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND);
        IndexWriter iw = new IndexWriter(dir, cfg);
        iw.commit();
        indexWriter = iw;
    }
    
    /**
     * Initializes the index's SearcherManager component, which
     * automatically manages the IndexSearcher instances
     * @throws IOException
     */
    private void initSearcherManager() throws IOException
    {
        searchMgr = new SearcherManager(indexWriter, true, null);
    }
    
    /**
     * Finds and returns the document with the given id.
     * @param id
     * @return
     * @throws IOException
     */
    public Document getDocumentById(long id) throws IOException
    {        
        searchMgr.maybeRefreshBlocking();        
        IndexSearcher searcher = searchMgr.acquire();
        try
        {
            NumericRangeQuery q = NumericRangeQuery.newLongRange("id", id, id, true, true);
            TopDocs topDocs = searcher.search(q, 1);
            if(topDocs.scoreDocs.length == 0) return null;
            ScoreDoc scoreDoc = topDocs.scoreDocs[0];
            Document doc = searcher.doc(scoreDoc.doc);
            return doc;
        }
        finally
        {
            searchMgr.release(searcher);
        }
    }
    
    /**
     * Generates a new unique id for a document.
     * @throws IOException
     */
    private long generateDocID() throws IOException
    {
        long id;
        do
        {
            id = Math.abs(UUID.randomUUID().getLeastSignificantBits());
        }
        while (getDocumentById(id) != null); // we want a unique id
        
        return id;
    }
    
    /**
     * Adds a document to the index.
     * @param doc
     * @throws IOException
     */
    public void addDocument(Document doc) throws IOException
    {
        // generate a unique id for the document and add it to the index
        doc.add(new LongField("id", generateDocID(), Field.Store.YES));
        indexWriter.addDocument(doc);
        indexWriter.commit();
    }
    
    /**
     * Generates a boolean or a phrase query for a given field.
     * @param fieldName
     * @param value
     */
    private static Query createStringFieldQuery(String fieldName, String value)
    {
        if (value.matches("\".*\""))
        {
            // create phrase query
            String unquotedStr = value.substring(1, value.length() - 1);
            PhraseQuery phrase = new PhraseQuery();
            for (String tok : unquotedStr.split("\\s"))
            {
                phrase.add(new Term(fieldName, tok));
            }
            return phrase;
        }
        else
        {
            // create a boolean OR query
            BooleanQuery boolQuery = new BooleanQuery();
            for (String tok : value.split("\\s"))
            {
                boolQuery.add(new TermQuery(new Term(fieldName, tok)), BooleanClause.Occur.SHOULD);
            }

            return boolQuery;
        }
    }
    
    /**
     * Performs a search on the index. The query parameters are passed via a
     * SearchParams object.
     * @param params
     * @return A list of the top-10 documents matching the query
     * @throws IOException
     * @throws ParseException
     */
    public List<Document> search(SearchParams params) throws IOException, ParseException
    {        
        searchMgr.maybeRefreshBlocking();        
        IndexSearcher searcher = searchMgr.acquire();
        try
        {
            List<Document> results = new ArrayList<Document>();
            BooleanQuery query = new BooleanQuery();

            // create query for string fields
            for (Map.Entry<String, String> stringField : params.stringSet())
            {
                query.add(createStringFieldQuery(stringField.getKey(), stringField.getValue()),
                        BooleanClause.Occur.SHOULD);
            }
            // create query for int fields
            for (Map.Entry<String, Integer> intField : params.intSet())
            {
                query.add(NumericRangeQuery.newIntRange(intField.getKey(), intField.getValue(),
                        intField.getValue(), true, true), BooleanClause.Occur.SHOULD);
            }

            QueryParser qp = new QueryParser(Version.LUCENE_4_9, "", analyzer);
            Query q = qp.parse(query.toString());
            TopDocs topDocs = searcher.search(q, 10);
            
            for (ScoreDoc doc : topDocs.scoreDocs)
            {
                results.add(searcher.doc(doc.doc));
            }

            return results;
        }
        finally
        {
            searchMgr.release(searcher);
        }
    }
    
    /**
     * Deletes the document with the given id from the index.
     * @param id
     * @throws IOException
     */
    public void deleteDocument(long id) throws IOException
    {    
        // delete from index
        Query deleteQuery = NumericRangeQuery.newLongRange("id", id, id, true, true);
        indexWriter.deleteDocuments(deleteQuery);
        indexWriter.commit();
    }
    
    /**
     * Returns the number of documents currently indexed.
     * @throws IOException
     */
    public int getNumDocs() throws IOException
    {       
        searchMgr.maybeRefreshBlocking();        
        IndexSearcher searcher = searchMgr.acquire();
        try
        {
            return searcher.getIndexReader().numDocs();
        }
        finally
        {
            searchMgr.release(searcher);
        }       
    }
    
    /**
     * Closes the index along with IndexWriter and SearcherManager components.
     * Call this method after you're done working with the index.
     * @throws IOException
     */
    public void close() throws IOException
    {
        synchronized(this)
        {
            if(searchMgr != null) searchMgr.close();
            if(indexWriter != null) indexWriter.close();
            searchMgr = null;
            indexWriter = null;
            isOpen = false;
        }
    }
    
    /**
     *  Reopens the index.
     *  close() must be called first.
     * @throws IOException
     */
    public void reopen() throws IOException
    {
        synchronized(this)
        {
            openIndex();
            initSearcherManager();
            isOpen = true;
        }
    }
    
}

Examples of usage

A stateless session bean that uses the IndexWrapper class to add a PDF document to the Lucene index

@Stateless
@TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED)
public class IndexManager implements IndexManagerLocal
{
    public static final String ARCHIVE_PATH = "C:\\archive";
    public static final String INDEX_PATH = "C:\\index";
    
    private Document createDocument(String author, String title, int year, String fileName, String content)
    {
        Document doc = new Document();
        
        doc.add(new TextField("author", author, Field.Store.YES));
        doc.add(new TextField("title", title, Field.Store.YES));
        doc.add(new TextField("content", content, Field.Store.NO));
        doc.add(new IntField("year", year, Field.Store.YES));
        doc.add(new StoredField("fileName", fileName));
        
        return doc;
    }
    
    @Override
    public void addDocument(String author, String title, int year, String fileName)
    {
        try
        {
            // get a IndexWrapper instance for this index
            IndexWrapper indexWrapper = IndexWrapper.getInstance(INDEX_PATH);
            
            // get PDF text content via the PDFBox library
            File file = new File(ARCHIVE_PATH + "\\" + fileName);
            PDFTextStripper textStripper = new PDFTextStripper();
            PDDocument pdDoc = PDDocument.load(file);
            String content = textStripper.getText(pdDoc);
            pdDoc.close();

            // create the Document object and add it to the index
            Document doc = createDocument(author, title, year, fileName, content);
            indexWrapper.addDocument(doc);
        }
        catch (Exception ex)
        {
            Logger.getLogger(IndexManager.class.getName()).log(Level.SEVERE, null, ex);
        }
    }
}

Performing a search on the index

// choose some query parameters
String author, content;
int year;

[...]

IndexWrapper indexWrapper = IndexWrapper.getInstance(INDEX_PATH);
SearchParams params = new SearchParams();

params.putString("author", author);
params.putString("content", content);
params.putInt("year", year);

List<Document> docs = indexWrapper.search(params);

 

 

A data mining experiment: movie reviews classification using WEKA

In this post I’m going to show a simple machine learning experiment where I perform a sentiment classification task on a movie reviews dataset using WEKA, an open source data mining tool. The goal is  to classify a movie review as positive or negative (for the reviewed movie).

I start by importing the reviews dataset in WEKA, then I perform some text preprocessing tasks such as word extraction, stop-words removal, stemming and term selection. Finally, I run various classification algorithms (naive bayes, k-nearest neighbors) and I compare the results, in terms of classification accuracy.

The movie reviews dataset

The dataset consists of 2000 user-created movie reviews archived on the IMDb (Internet Movie Database) web portal at http://reviews.imdb.com/Reviews and is known as “Sentiment Polarity Dataset version 2.0” (http://www.cs.cornell.edu/People/pabo/movie-review-data). The reviews are equally partitioned into a positive set and a negative set (1000+1000).

Each review consist of a plain text file (.txt) and a class label representing the overall user opinion. The class attribute has only two values: pos (positive) or neg (negative). The dataset creators explain that the class assignment to each review has been determined by using some simple rules that consider the user’s vote for the movie, as extracted from the original review. For instance, if a 1-10 scale is used, a vote greater or equal to 6 is considered as positive and anything less than 6 as negative. The authors also state that similar criteria have been used in the case of different vote scales (e.g., 1-5, A-F, etc.).

Importing the dataset in WEKA

First thing to be done is to import the dataset in the WEKA tool. Practically speaking, we have an archive containing 2000 text files partitioned in two sub-directories pos and neg (the class values). WEKA provides a simple import procedure for textual datasets, by means of the TextDirectoryLoader component. By using this loader (fig. 1), WEKA automatically creates a relation with 2 attributes: the first one contains the text data, the second is the document class, as determined by the sub-directory containing the file (pos or neg).

TextDirectoryLoader
Fig. 1: importing the dataset in WEKA

After the import, you can save the dataset in the ARFF format and manually edit it with a text editor to set some decent names for the relation and the attributes. Figure 2 shows the resulting relation.

Fig. 2: imported dataset and class distribution
Fig. 2: imported dataset and class distribution

As expected, we get a relation containing 2000 instances and two attributes (text and class). The histogram in the figure shows the uniform distribution of the review classes (blue = negative, red = positive).

Text preprocessing and feature extraction in WEKA

 For the classification task to be done, a preliminary phase of text preprocessing and feature extraction is essential. We want to transform each text in a vector form, in which each document is represented by the presence (or frequency) of some “important” terms; this terms are the ones contained in the collection vocabulary. To build the vocabulary, various operations are typically performed (many of which are language-specific):

  • Word parsing and tokenization
    In this phase, each document is analyzed with the purpose of extracting the terms. Separator characters must be defined, along with a tokenization strategy for particular cases such as accented words, hyphenated words, acronyms, etc.
  • Stop-words removal
    A very common technique is the elimination of frequent usage words: conjunctions, prepositions, base verbs, etc. This kind of terms should be filtered as they have a poor characterizing power, making them useless for the text classification.
  • Lemmatization and stemming
    The lemmatization of a word is the process of determining its lemma. The lemma can be thought of as the “common root” of various related inflectional forms; for instance, the words walk, walking and walked all derive from the lemma walk.
    A simple technique for approximated lemmatization is the stemming. Stemming algorithms work by removing the suffix of the word, according to some grammatical rules.
  • Term selection/feature extraction
    The term set resulting from the previous phases has still to be filtered, since we need to remove the terms that have poor prediction ability (w.r.t the document class) or are strongly correlated to other terms. This term selection task also leads to a simpler and more efficient classification.

To perform the preprocessing in WEKA, I used the StringToWordVector filter from the package weka.filters.unsupervised.attribute. This filter allows to configure the different stages of the term extraction (fig.3). Indeed, you can:

  • Configure the tokenizer (term separators);
  • Specify a stop-words list;
  • Choose a stemmer.

The preprocessing operations will affect the quality of the classification,  for this reason I will perform various experiments on different generated datasets.

Fig. 3: StringToWordVector filter configuration
Fig. 3: StringToWordVector filter configuration

 The default text retrieval model used by the StringToWordVector filter is boolean: each document is represented with an n-dimensional boolean vector, where n is the size of the vocabulary, and each value models the presence or the absence of a vocabulary term in the document. One can also choose to use a frequency-based model such as the TF-IDF weighting model by setting to true the TFTransform and IDFTransform parameters.

You can set a stop-words list by clicking on stopwords and setting to true the useStopList parameter. In my experiments I used a 630 english stop-words list (whose origin I don’t recall) and the Porter’s stemmer (for the english language). Before you can use it, you must download and add to the classpath the snowball stemmers library.

Furthermore, you can set a maximum limit on the number of words to be extracted by changing the wordsToKeep parameter (default is 1000 words) and a minimum document frequency for each term by means of the minTermFreq parameter (default is 1). The latter parameter makes the filter drop the terms that appear in less than minTermFreq documents (of the same class – see note). NOTE:  by default, these two parameters are considered with respect to the document class. To have them applied without considering the class, set DoNotOperateOnPerClassBasis to true.

After applying the StringToWordVector filter, we get the result shown in figure 4.

Fig. 4: after term extraction
Fig. 4: after term extraction

 We get a relation containing 1251 binary attributes. The histogram shows the distribution of the term ‘bad‘ among the documents:  mostly, it appears (value 1) in negative reviews (blue color).

The last preprocessing operation is the attribute selection. Like I said, eliminating the poorly characterizing attributes can be useful to get a better classification accuracy. For this task, WEKA provides the AttributeSelection filter from the weka.filters.supervised.attribute package. The filter allows to choose an attribute evaluation method and a search strategy (fig. 5).

Fig. 5: AttributeSelection filter parameters

 The default evaluation method is CfsSubsetEval (Correlation-based feature subset selection). This method works by choosing attributes that are highly correlated with the class attribute while having a low correlation with other attributes.

After applying the AttributeSelection filter, we obtain the result show in figure 6.

Fig. 6: after applying the AttributeSelection filter

 As you can see, the number of attributes has greatly decreased, passing from more than 1000 to just 52.

Classification

The classification problem is a supervised learning task that consists in assigning a class label to an unclassified tuple according to an already classified instance set, that is used as a training set for the algorithm.

The two classification algorithms that I’m going to use are Naive Bayes and k-nearest neighbors (k = 1, 3, 5). The quality measure that will be considered is the percentage of correctly classified instances. For the validation phase I’ll use the 10-fold cross validation method.

Naive bayes classifier

The Naive Bayes classifier assigns a class to an unlabeled object according to a maximum likelihood principle. More specifically, the instance is assigned the class which maximizes the “a posteriori” probability,  which is a function of the class prior probability and of the instance likelihood w.r.t to the class (fig. 7).

Fig. 7: Naive Bayes classifier (1)

This classifier is called naive because it assumes the attribute indipendence hypothesis. With that assumption, the computation of the conditional probability P(d | c) becomes just a calculation of a product between the probabilities of each attribute (fig. 8).

Fig. 8: Naive Bayes classifier (2)

In WEKA, the Naive Bayes classifier is implemented in the NaiveBayes component from the weka.classifiers.bayes package. The best result achieved with this classifier has shown a correctness percentage of 81,45% (fig. 9), using a dataset on which only attribute selection was performed (no stop-words removal, no stemming – see comparison table).

Fig. 9: best classification result with Naive Bayes

K-nearest neighbors classifier

The k-nearest neighbors classifier assigns to an instance the class that is prevalent among the k nearest instances. For this to be possible, a distance function between the instances must be defined. In this experiment the euclidean distance was used. The k parameter is chosen to be an odd number, so that a majority always exists. When k = 1, the classifier simply becomes a nearest neighbor classifier.

In WEKA, the k-NN classifier is implemented in the weka.classifiers.lazy.IBk component. As already said, NN, 3-NN and 5-NN were used in this experiment. The best result achieved with this kind of classifiers has shown a correctness percentage of 73,95% (fig. 10), using the 3-nearest neighbors classifier on a dataset on which stop-words removal, stemming and attribute selection were performed (see comparison table).

Fig. 10: best classification result with k-nearest neighbors (3-NN)

Comparison of results

 

 

 

Java implementation of a max flow algorithm on a graph

In this post I’ll describe a Java implementation of a fast maximum flow algorithm, known as the Ahuja-Orlin max flow algorithm. This algorithm implementation is part of a small and easy to use Java class library which can be used to model a flow graph, along with its nodes and edges, and to find the maximum flow that can be sent from a source node to a sink node. This library also provides a convenient way to calculate the maximum flow on a subgraph.

The Ahuja-Orlin max flow algorithm

The algorithm I chose to implement is Improved shortest augmenting path from Ahuja and Orlin. I chose it after finding that it came out as the best performing algorithm in an experimental analysis of max-flow graph algorithms from TopCoder.

This algorithms uses the concepts of shortest augmenting path and distance labels. The algorithm iteratively searches the shortest augmenting path in the residual network of the graph. It terminates when some conditions on the distance labels are met: these conditions indicate that the flow distribution is optimal and so no other augmenting paths exist.

In the following listing, the algorithm pseudocode is reported (source: TopCoder). The source node is called s and the sink node is called t.

improved-shortest-augmenting-path-alg

Given an input graph, the algorithm initially builds the associated residual network. To obtain a residual network, back-edges are added to the graph: for each edge (i, j), a (j, i) edge with zero capacity and flow is created.

Then, an initial labelling of the nodes is performed. Each node is associated with a distance label d[i] representing the length, in term of nodes, of the shortest path between i and the sink node in the residual network, having d[t] = 0. This labelling is done by means of a reverse breadth-first search of the graph, starting from the sink node.

If, in the iterations of the algorithm, the distance label of node s becomes greater or equal to the number of nodes, no more augmenting paths can exist in the residual network; this is the main termination criterion for the algorithm.
An edge (i, j) is called admissible if d[i] = d[j] + 1. The shortest augmenting path between s and t is defined as a s -> t path consisting only of admissible edges, each one with a residual capacity rij > 0.

The algorithm is composed of 4 main procedures: the main cycle, the advance procedure, the retreat procedure, and the augment procedure. In the main cycle, the algorithm iteratively tries to build an augmenting path from s to t, by using admissible edges.

If an admissible edge can be taken from the current node, an advance is performed, by means of which the path and the current node are updated. If the t node is reached, the augment takes place, which calculates the flow increment of the path, updates the graph, and resets s as the current node.

If, conversely, the are no admissible outbound edges from the current node, the retreat operation is executed, which increments the current node’s distance label and backtracks the augmenting path to the previous node.

An additional termination criterion is also used. During the generic iterations, the algorithm keeps track of the distance labels distribution over the nodes. When a node’s distance label is incremented, the algorithm tests whether the number of nodes having the previous label value has become 0: in this case, there is the guarantee that no more augmenting paths exist, and so the algorithm terminates.

Ahuja-Orlin Java implementation

To implement the solution I had to design the graph first, which is the data structure the algorithm runs on. I implemented a directed graph suitable for flow problems, with limited capacity edges. The graph internal representation consist of adiacency and incidency lists; this allows an easier navigation of the graph: for a generic node, both the adjacent and incident edges can be obtained. This solution also allows to optimize some operations executed by the max flow algorithm. The following diagrams show the structure and the relationships of the 5 modules of the solution: the Graph interface and the FlowGraph, Node, Edge, and MaxFlowCalculator classes.

class-node-edgeThe Edge and Node classes model the basic components of the graph. The edges use double values to model the edge capacity and flow. The nodes are represented with a integer id and a label (string).

graph-interfaceThe graph interface (of which FlowGraph is a concrete implementation) exposes the common methods for the graph manipulation and navigation, and additionally :

  • a clone() method that returns a “deep copy” of the graph;
  • The setSource(Node) and setSink(Node) methods, used to specify the source and the sink nodes;
  • a getSubGraph(Set<Integer> s) method that returns the induced subgraph from a given node subset (ids) and from the source and sink nodes.

The max flow algorithm has been implemented in the MaxFlowCalculator class. This class, as represented by the following diagram, exposes one public and static method double getMaxFlow(Graph) which calculates and returns the maximum flow value for the input graph.

maxflowcalculator

The various steps of the algorithm have been implemented in different private methods for clarity purposes. The distance labels are managed by a DistanceLabels class, which provides methods for obtaining and setting the node labels, and keeps track of the label distribution (this is useful to implement the algorithm’s secondary termination criterion). The getMaxFlow method implementation is reported below.

public static double getMaxFlow(Graph g)
{
    if(g.numNodes() == 0)
    {
        return 0;
    }

    DistanceLabels labels = calcDistanceLabels(g);
    double f = 0; // max flow
    int n = g.numNodes();
 // add back edges to create residual network
    List<Edge> backEdges = addBackEdges(g);
 // current augmenting path
    LinkedList<Edge> path = new LinkedList<Edge>(); 
 // distance of source from the sink
    int sourceDist;
    Node i = g.getSource(); // current node

 /*
   Main termination criterion: distance label of source node is greater or
   equal to the number of nodes. In that case, no more augmenting paths
   from the source to the sink can exist.
 */

    while(i != null && (sourceDist = labels.getLabel(g.getSource())) < n)
    {
        Edge e = getAdmissibleEdge(g, i, labels);
        if(e != null)
        {
            i = advance(e, path);
            if(i.equals(g.getSink()))
            {
                double delta = augment(g, path);
                f += delta;
                i = g.getSource();
                path.clear();
            }
        }
        else i = retreat(g, labels, i, path);
    }

    removeBackEdges(g, backEdges);

    return f;
}

Usage example

Let’s show how this Java class library can be used to calculate the maximum flow on a simple test graph, reported in the following figure:

graph

This is a flow graph containing 7 nodes and 12 edges; the a node is the source and the b node is the sink. First of all, let’s see how to build the graph:

private static Graph buildTestGraph()
{
    Graph graph = new FlowGraph();

    Node na = new Node(0, "A");
    Node n1 = new Node(1, "N1");
    Node n2 = new Node(2, "N2");
    Node n3 = new Node(3, "N3");
    Node n4 = new Node(4, "N4");
    Node n5 = new Node(5, "N5");
    Node nb = new Node(6, "B");

    graph.addNode(na);
    graph.setSource(na);
    
    graph.addNode(n1);
    graph.addNode(n2);
    graph.addNode(n3);
    graph.addNode(n4);
    graph.addNode(n5);
    
    graph.addNode(nb);
    graph.setSink(nb);

    graph.addEdge(new Edge(na, n1, 3));
    graph.addEdge(new Edge(na, n3, 5));
    graph.addEdge(new Edge(na, n2, 3));
    graph.addEdge(new Edge(n1, n3, 4));
    graph.addEdge(new Edge(n1, n4, 3));
    graph.addEdge(new Edge(n3, n4, 2));
    graph.addEdge(new Edge(n3, nb, 1));
    graph.addEdge(new Edge(n3, n5, 4));
    graph.addEdge(new Edge(n2, n3, 2));
    graph.addEdge(new Edge(n2, n5, 2));
    graph.addEdge(new Edge(n4, nb, 4));
    graph.addEdge(new Edge(n5, nb, 4));

    return graph;
}

Now let’s calculate the max flow that can be sent from a to b on the whole graph:

// build the test graph
Graph g = buildTestGraph();
    
// calculate max flow from source to sink on g
double f = MaxFlowCalculator.getMaxFlow(g);
System.out.println("max flow on g = " + f);
// print the flow distribution on g
System.out.println("flow distribution on g = " + g.getEdges());

The output is:

max flow on g = 9.0
flow distribution on g = [(0, 1) [3.0 / 3.0], (0, 3) [5.0 / 5.0], (0, 2) [1.0 / 3.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (2, 3) [0.0 / 2.0], (2, 5) [1.0 / 2.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (3, 5) [3.0 / 4.0], (4, 6) [4.0 / 4.0], (5, 6) [4.0 / 4.0]]

So the maximum flow between a and b on the entire network is 9. The output also shows the flow distribution (obtained with the getEdges() method of the graph). Now let’s see how the maximum flow changes if only a part of the graph is used to route the flow from the source to the sink.

The subgraph induced from the 1, 3 and 4 nodes is the following:

graph-134

This subgraph is obtained from the original test graph, by removing the 2 and 5 nodes and the associated edges. The code for obtaining the subgraph and calculating the max flow follows:

// calculate the max flow from source to sink using
// the subgraph induced from nodes 1, 3, 4
Set<Integer> s134 = new HashSet<Integer>();
s134.add(1);
s134.add(3);
s134.add(4);
Graph g134 = buildTestGraph().getSubGraph(s134);
double f134 = MaxFlowCalculator.getMaxFlow(g134);
System.out.println("max flow on g[1, 3, 4] = " + f134);
System.out.println("flow distribution on g[1, 3, 4] = " +
    g134.getEdges());

To get the subgraph, you must call getSubGraph on the graph object, and pass a set of node ids (1, 3 and 4, in this case). The result we get is:

max flow on g[1, 3, 4] = 5.0
flow distribution on g[1, 3, 4] = [(0, 1) [3.0 / 3.0], (0, 3) [2.0 / 5.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (4, 6) [4.0 / 4.0]]

As expected, using a smaller network, the max flow that can be sent is lower. Finally, let’s consider an edge case (the irony), in which we can’t have any flow between the two nodes. The subgraph induced from node 2 is:

graph-2

In this case, no path from the source node to the sink node exists,  so we expect the flow to be 0. The code is very similar to the previous examples:

Set<Integer> s2 = new HashSet<Integer>();
s2.add(2);
Graph g2 = buildTestGraph().getSubGraph(s2);
double f2 = MaxFlowCalculator.getMaxFlow(g2);
System.out.println("max flow on g[2] = " + f2);
System.out.println("flow distribution on g[2] = " + g2.getEdges());

The output is:

max flow on g[2] = 0.0
flow distribution on g[2] = [(0, 2) [0.0 / 3.0]]

Download source code

 

Implementazione Java di un algoritmo per il problema del massimo flusso su grafo

Recentemente ho dovuto implementare un algoritmo per il calcolo del massimo flusso su grafo per un progetto universitario. Il problema da risolvere era il calcolo del massimo flusso inviabile da un nodo sorgente (source node) a un nodo pozzo (sink node) utilizzando il grafo, oppure solo una parte di esso. C’era infatti la necessità di calcolare il massimo flusso al variare del sottografo indotto da un certo insieme dei nodi.

L’algoritmo doveva andare a integrarsi in un progetto più grande, quindi oltre ai requisiti di efficienza c’era la necessità di costruire una soluzione strutturata e modulare.

Algoritmo per il massimo flusso utilizzato

L’algoritmo che ho scelto di implementare è Improved shortest augmenting path di Ahuja e Orlin. Ho scelto questo algoritmo dopo aver visto che è risultato il più performante in un’analisi sperimentale degli algoritmi per il massimo flusso su grafo condotta da TopCoder.

Questo algoritmo si basa sulle idee di minimo cammino aumentante e etichette di distanza (distance label). L’algoritmo prevede, in generale, di cercare iterativamente il più breve cammino aumentante sulla rete residuale del grafo, e di terminare quando vengono soddisfatte alcune condizioni sulle distance label.

Nella figura seguente è riportato lo pseudocodice dell’algoritmo (fonte: TopCoder). Il nodo sorgente è indicato con s e il nodo pozzo con t.

improved-shortest-augmenting-path-alg

Dato un grafo, l’algoritmo inizialmente costruisce la rete residuale associata. Per ottenere una rete residuale, vengono aggiunti dei back-edge: per ogni arco (i, j) viene aggiunto un arco (j, i) con capacità e flusso nulli.

Sempre nella fase iniziale, viene effettuata una pre-etichettatura dei nodi del grafo. Ad ogni nodo i viene associata una distance label d[i] che rappresenta la lunghezza, in termini di numero di nodi, del cammino minimo tra i e il nodo pozzo nella rete residuale, con d[t] = 0. Quest’etichettatura è implementata mediante una visita in ampiezza al contrario del grafo, che parte dal nodo pozzo.

Se durante le iterazioni dell’algoritmo, la distance label associata al nodo s diventa maggiore o uguale al numero dei nodi del grafo, non ci potrà più essere un cammino aumentante nella rete residuale; questo è il principale criterio di terminazione dell’algoritmo.
Un arco (i, j) è definito ammissibile se d[i] = d[j] + 1. Se un cammino è costituito da archi ammissibili, e ciascun arco ha una capacità residua rij > 0, quel cammino è il più breve cammino aumentante tra s e t.

L’algoritmo si compone di 4 procedure principali: il ciclo principale, la procedura di advance, la procedura di retreat, e la procedura di augment. Nel ciclo principale, l’algoritmo cerca iterativamente di costruire un cammino aumentante da s a t, utilizzando archi ammissibili.

Se dal nodo corrente è possibile prendere un arco uscente ammissibile, viene effettuata una advance, mediante la quale viene aggiornato il cammino e il nodo corrente. Se si raggiunge il nodo t viene effettuato l’augment, che calcola l’incremento di flusso del cammino, aggiorna il grafo, e reimposta il nodo s come nodo corrente.

Se, al contrario, non ci sono archi uscenti ammissibili dal nodo corrente, viene effettuata l’operazione retreat, che retrocede il costruendo cammino aumentante al nodo precedente e incrementa la distance label del nodo corrente.

L’algoritmo prevede inoltre un criterio di terminazione aggiuntivo. Durante le iterazioni dell’algoritmo, si tiene traccia della distribuzione dei valori delle distance label sui nodi. Quando, nell’aggiornamento delle etichette, si incrementa la distance label di un nodo, si verifica se il numero di nodi che hanno associato il valore precedente dell’etichetta scende a 0: in questo caso, c’è la garanzia che non potrà più esistere un cammino aumentante nel grafo, e l’algoritmo termina.

Implementazione Java

Per implementare la soluzione è stato necessario innanzitutto progettare il grafo, che è la struttura dati sulla quale l’algoritmo esegue. Ho implementato un grafo orientato adatto ai problemi di flusso, con archi di capacità limitata. Il grafo è rappresentato internamente mediante liste di adiacenza e di incidenza; questo permette di navigare il grafo più facilmente: per un nodo è possibile richiedere, oltre alla lista degli archi adiacenti, anche quella degli archi incidenti. Questo consente di ottimizzare alcune operazioni che verranno eseguite dall’algoritmo max flow.

Nei diagrammi seguenti viene mostrata la struttura e le relazioni dei 5 moduli del progetto: l’interfaccia Graph e le classi FlowGraph, Node, Edge, e MaxFlowCalculator.

class-node-edgeLe classi Edge e Node rappresentano gli archi e i nodi del grafo. Per gli archi vengono modellati, con valori double, la capacità e il flusso. I nodi sono rappresentati da un id intero e un’etichetta (stringa).

graph-interfaceL’interfaccia del grafo (di cui FlowGraph è un’implementazione concreta) prevede i principali metodi per la manipolazione e la navigazione della struttura, oltre a:

  • un metodo clone() che restituisce una copia “profonda” del grafo;
  • i metodi setSource(Node) e setSink(Node) per specificare quali sono i nodi sorgente e pozzo;
  • un metodo getSubGraph(Set<Integer> s) che restituisce il sottografo indotto da un certo sottoinsieme di nodi (id) e dai nodi source e sink.

L’algoritmo max flow è stato implementato nella classe MaxFlowCalculator. Questa classe, come rappresentato nel diagramma successivo, espone un unico metodo pubblico e statico double getMaxFlow(Graph) che calcola e restituisce il valore del massimo flusso sul grafo in input.

maxflowcalculator

I vari passi dell’algoritmo sono stati implementati in metodi privati separati per ragioni di chiarezza. Le etichette di distanza vengono gestite da una classe DistanceLabels, che fornisce i metodi per ottenere e impostare le etichette dei nodi, e tiene traccia della loro distribuzione (questo è utile per implementare il criterio di terminazione secondario dell’algoritmo).

Esempio di utilizzo

Vediamo come è possibile usare questa libreria di classi per calcolare il massimo flusso su un semplice grafo di test, riportato nella figura seguente:

graph

 Si tratta di un grafo di flusso con 7 nodi e 12 archi; il nodo a è la sorgente e il nodo b il pozzo. Vediamo innanzitutto come costruire il grafo:

private static Graph buildTestGraph()
{
    Graph graph = new FlowGraph();

    Node na = new Node(0, "A");
    Node n1 = new Node(1, "N1");
    Node n2 = new Node(2, "N2");
    Node n3 = new Node(3, "N3");
    Node n4 = new Node(4, "N4");
    Node n5 = new Node(5, "N5");
    Node nb = new Node(6, "B");

    graph.addNode(na);
    graph.setSource(na);
    
    graph.addNode(n1);
    graph.addNode(n2);
    graph.addNode(n3);
    graph.addNode(n4);
    graph.addNode(n5);
    
    graph.addNode(nb);
    graph.setSink(nb);

    graph.addEdge(new Edge(na, n1, 3));
    graph.addEdge(new Edge(na, n3, 5));
    graph.addEdge(new Edge(na, n2, 3));
    graph.addEdge(new Edge(n1, n3, 4));
    graph.addEdge(new Edge(n1, n4, 3));
    graph.addEdge(new Edge(n3, n4, 2));
    graph.addEdge(new Edge(n3, nb, 1));
    graph.addEdge(new Edge(n3, n5, 4));
    graph.addEdge(new Edge(n2, n3, 2));
    graph.addEdge(new Edge(n2, n5, 2));
    graph.addEdge(new Edge(n4, nb, 4));
    graph.addEdge(new Edge(n5, nb, 4));

    return graph;
}

Ora vediamo come calcolare il massimo flusso inviabile tra a e b utilizzando l’intero grafo:

// costruisci il grafo di test
Graph g = buildTestGraph();
    
// calcola il massimo flusso tra sorgente e pozzo su g
double f = MaxFlowCalculator.getMaxFlow(g);
System.out.println("max flow on g = " + f);
// stampa la distribuzione di flusso di g
System.out.println("flow distribution on g = " + g.getEdges());

L’output è:

max flow on g = 9.0
flow distribution on g = [(0, 1) [3.0 / 3.0], (0, 3) [5.0 / 5.0], (0, 2) [1.0 / 3.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (2, 3) [0.0 / 2.0], (2, 5) [1.0 / 2.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (3, 5) [3.0 / 4.0], (4, 6) [4.0 / 4.0], (5, 6) [4.0 / 4.0]]

Il massimo flusso inviabile tra a e b utilizzando l’intera rete è quindi 9. In output viene stampata anche la distribuzione di flusso (attraverso il metodo getEdges() del grafo). Vediamo ora come cambia il massimo flusso se si utilizza solo una parte del grafo di test.

Il sottografo indotto dai nodi 1, 3 e 4 è il seguente:

graph-134

Questo sottografo è ottenuto dal grafo di test originale, eliminando i nodi 2, 5, e gli archi associati. Segue il codice per ottenere il sottografo e calcolare il massimo flusso:

// calcola il massimo flusso tra sorgente e pozzo utilizzando il
// sottografo indotto dai nodi 1, 3 e 4
Set<Integer> s134 = new HashSet<Integer>();
s134.add(1);
s134.add(3);
s134.add(4);
Graph g134 = buildTestGraph().getSubGraph(s134);
double f134 = MaxFlowCalculator.getMaxFlow(g134);
System.out.println("max flow on g[1, 3, 4] = " + f134);
System.out.println("flow distribution on g[1, 3, 4] = " +
    g134.getEdges());

Per ottenere il sottografo è necessario utilizzare il metodo getSubGraph del grafo, al quale va passato come parametro un insieme di id di nodi (in questo caso 1, 3 e 4). L’output ottenuto è:

max flow on g[1, 3, 4] = 5.0
flow distribution on g[1, 3, 4] = [(0, 1) [3.0 / 3.0], (0, 3) [2.0 / 5.0], (1, 3) [0.0 / 4.0], (1, 4) [3.0 / 3.0], (3, 4) [1.0 / 2.0], (3, 6) [1.0 / 1.0], (4, 6) [4.0 / 4.0]]

Utilizzando una rete più piccola, il massimo flusso inviabile è inferiore. Consideriamo infine un caso limite, nel quale non è possibile avere flusso tra i due nodi. Il sottografo indotto dal nodo 2 è:

graph-2

In questo caso non esiste un cammino tra la sorgente e il pozzo, quindi ci si aspetta che il flusso sia nullo. Il codice è molto simile ai casi precedenti:

Set<Integer> s2 = new HashSet<Integer>();
s2.add(2);
Graph g2 = buildTestGraph().getSubGraph(s2);
double f2 = MaxFlowCalculator.getMaxFlow(g2);
System.out.println("max flow on g[2] = " + f2);
System.out.println("flow distribution on g[2] = " + g2.getEdges());

Il risultato ottenuto è:

max flow on g[2] = 0.0
flow distribution on g[2] = [(0, 2) [0.0 / 3.0]]

Download codice sorgente