If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

After weeks of fighting with insufficient memory and massive processing time, the first analyzed cluster is finally up. Overall While the results are not too impressive I think its a fair result for the first pre-alpha run.

The current content filtering and semantic aggregation / clustering is primarily based on Markov Clustering. Unfortunately however I’m going to have to move away from Markov simply because of it’s exponential use of memory. The primary issue with Markov clustering is that it requires the use of adjacency matrices. While there is great power in adjacency matrices these are simply not scalable with respect to large graphs.

The following list highlights the memory issues:

  • Graph with 10 vertices = Adjacency Matrix of 10 X 10 = 100 vectors
  • Graph with 100 vertices = Adjacency Matrix of 100 X 100 = 10000 Vectors
  • Graph with 10000 vertices = Adjacency Matrix of 10000 X 10000 = 100000000 Vectors.

This first cluster consists of just over 65000 vertices (or content references if you wish). In addition it presented just under 420 million edges (or semantic relationships if you wish). The edges themselves were not much of a challenge as these can be aggregated / simplified. However Markov Clustering required an adjacency matrix of 4.2 Billion vectors to analyze just over 65000 content references. Translated into memory usage, this equates to and array of 8.2 Billion integers. In terms of memory usage this translates to a minimum of 32 Gig of Ram plus overheads.

In theory I could have used a delegate based graph using cached SQL for persistence or memory mapped file for the adjacency matrix. However the latency of disk IO would simply blow out processing time to weeks if not months.

As you can see Markov Clustering or any other means of cluster analysis based on adjacency graphs is simply not scalable.

Finally the content references were processed in 10 different partitions based on various adjustments of the algoritm parameters. These partitions are listed below

Semantic Collated Information and Resources Directory on the topic of MBA or Masters in Business Administration:

Hope the information resources in this directory is useful to those researching an MBA

Thats it for now until the next version of the algorithms.