Semantic Relational Content Analysis

If you're new here, you may want to subscribe to my RSS feed. Thanks for visiting!

About this project:

This is the latest project of mine and unlike my previous research projects I’ve decided that this project will fund itself. Mostly because I’m sick and tired of draining my bank account int he name of research. So why run Ads, why not ask for donations or release a shareware tool ? Well the answer is simple. Been there done that:

  1. Releasing stuff for free and asking for donations: Well out of a thousand people who will use the stuff for free perhaps one will make a small donation of a few dollars. Been there, done that and it is not sustainable
  2. Releasing shareware tools. It’s only a matter of time until some group of teenagers rip the tools and release them via torrents and rapidshare. It’s really not worth it.

Why does this project need to fund itself? This research project requires rather hefty multi processor servers. I’m currently running a leased 16 core server for analysis and my own quad core server for coding. However that is but a small fraction of what I really require. All in all this project is currently costing me about 2 grand U.S. per month plus oodles of my time. At present 99% of that is going out of my pocket. However over time as the software improves I expect that it should end up paying for itself (and hopefully for the extra infrastructure that I need).

Definition of the project:

Search engines are brilliant for finding content references (or annotations) based on keyword search patterns. However search engines are generalized systems which need to cater to a broad audience and therefore their algorithms cannot be optimized for any particular content theme. In addition search engines are required to analyze billions of web pages and therefore human input is simply not an option. As a result there is a significant difference between a search results page and a cluster of related content. Here by content annotation I mean a short snippet content summary with a reference link to the full document elsewhere.

Industry application of the concept:

The most obvious industry application is in terms of content indexation, categorization and search for the likes of intranets, knowledge bases, and corporate document repositories.

The difference between human and computer:

Human analysis, categorization and clustering of content based on semantic relationships is only possible when the volume of content snippets is small. For example consider that a list of 5000 content snippets within a given theme will have in order of 10,000,000 individual relationships between each other. As a matter of fact the number of snippet to snippet relationships increases logarithmically with the volume of content snippets. Therefore we could assume that this task might be best suited for a machine.

Why would a machine fail in this context?

Unfortunately however it is often impossible for an algorithm to distinguish the difference when the same term may have multiple distinct meanings. For example consider “Oil Mist”. Oil Mist has at least two distinct meanings.

1) Oil Mist based cosmetics products. For example Olive Dry Oil Mist

2) Oil Mist industrial filtration. For example Oil Mist Collectors

As you can see these are two completely contrasting meanings of the same term Oil Mist. Including both terms in a single content cluster would result in a pile of worthless gibberish.

What is wrong with using the complete text for analysis ?

Designing an algorithm to filter content where the entire text of the content is available is very feasible (although not a simple task). Completely processing content in its entirety is unfortunately cost prohibitive in terms of computational resources (Unless of course a super computer is added to the mix). For example complete NLP based analysis of 5000 content pages would easily keep a quad processor server busy for a week or two. Add to this the fact that the majority of content repositories such as knowledge bases, intranets and document repositories tend to change quite frequently. Finally the complete NLP analysis of a large content repository will often require heavy use of very large databases in the range of 50  to 100 GB. Such databases require massive database servers. It should be obvious that providing the ability to search and find content one week in arrears is usually quite unacceptable.

The problem:

Hence as per the computational resources readily available to the average individual / organization, it is not usually feasible to implement fully automated algorithmic indexation, categorization and search functionality based on complete natural language processing of content in its entirety.

On the other hand for all practical purposes it is not feasible to perform the same task as humans either. Lets assume that a human can evaluate 6 content to content relationship per minute and is able to work for 8 hours non stop. The task is to fully evaluate the 10,000,000 (approx) distinct relationships within a cluster of 5000 content snippets (references, annotations etc…). As you can see this would take 10 people over an year.

The solution:

The solution I believe is based on a hybrid between the two. And therefore the notion of computer assisted human semantic categorization, relational grouping and sorting of content based on relevance to a given theme.

No comments yet.

Free email autoresponder