Bill Whiteley

First a Simple Explanation

I actually did this homework two times. The first time I used python like we were told to do in the assignment handout. I was able to make python work relatively well, but I had many problems when I tried to set up a live recommender on my website. So I finished it in a static manner. The write up is below. Then when I realized we had two weeks instead of one I started looking at it again and with ton of effort (because I don't or ehh didn't know PHP) I put a live recommender together. The write up for this is below as well.  However, the live recommender may have a few bugs, such as if you enter a URL, you will probably get back that URL as a recommended link. Simply didn't have time to do all the polishing I would have liked, but in the end it turned out AWESOME!

Homework 5 - Version Two (Live Recommender)

The Short Version

I used PHP to create a live recommender for using the data from del.icio.us. I got my start from this website and it pointed me to this code. However, the final implementation was entirely different and accomplished a completely different goal.
PHP Code

A Recommender System Using Del.icio.us Data

The Live Recommender system that I built with PHP works on a very simple algorithm.
  1. Receive an input URL
  2. Request the associated URL tags from del.icio.us
  3. Determine the most common (or least common) tags for the URL
  4. Query del.icio.us using those tags
  5. Rank the URLs with some criteria (they automatically come ranked by relevance)
There are three areas of this algorithm that I played with in a similar fashion to the first version of this homework. I didn't have time to implement them with user controls on the Live Recommender. The first area is the tags that are selected to get the new URLs. If the most common tags are selected then the results tend to be very similar to the submitted URL. In fact, the Live Recommender will nearly always return the input URL as one of its results (need to fix). However, if someone is looking for unique URLs all that needs to be done is substitute some of the less common tags. The results returned still tend to be interesting and useful, but perhaps in a slightly adjacent area to the submitted URL.

 A second way to vary the tags is to change the operator between the tag words. By default the Live Recommender puts an OR operator between the tags and changing some of the operators to an AND greatly affects the results.  When an AND is added to less common tags the results tend to be more unique and slightly less relevant and of course adding an AND operator to the most common tags reinforces the similiarity of the results to the input.

The final way to manipulate the algorithm is to look at the data returned with the result URLs from del.icio.us. The data contains the number of people that have linked to the particular web site. This information could be used to rank the resulting URLs. The more popular URLs tend to point to mainstream web sites and the sites booked marked by fewer people end to be more on the unique side. 

Homework 5 - Version One (Static)

The Short Version

The long version of my homework is below. Here is the short version:
The input link for the recommender system was www.weigend.com
Python Code
List One - Used tag word matching with high occurance tag words - Useful Links - The best list IMHO
List Two - Used tag work matching with low occurance tag words - Unique Links

A Recommender System for Del.icio.us

The task of this homework assignment was to build a system to recommend bookmarks using del.icio.us given a specific URL. I orginally intended to write a simple web that accomplished this dynamically and displayed a list of 25 suggested links for a given URL entered into a search box. However, very early on I discovered that del.icio.us is very agressive about throttling its API users and blocking an IP address for about two hours if it has an application querying the site too often. They do this even if you follow their rules of waiting one second between queries.  Thus I abandoned the small web app idea.

Instead building a web application I instead wrote some code in python that queries del.icio.us and saves the output to a file. This reduced the number of required queries to del.icio.us and the subsequent 503 errors. The real point of this homework was to explore different ideas and techniques one would use to build a useful recommender application.
 

List One, List Two

List one was created focusing on the tag words associated with posts on del.icio.us. It was actually the last list that I did but was my favorite so I called it list one. Basically the steps to create this list are as follows:
  1. Get all the posts for a given URL 
  2. Get the top five most used keywords for the given URL
  3. Query del.icio.us for the most recent URLs for each of the top five tags and store in a file
  4. Sort the URLs based on occurances of the tag words
  5. Return the top 25 URLs that do not match the input URL
The results of list one are pretty good in my opinion. There are six stories that either feature or include Andreas Weigend including a YouTube.com video of him. Some of the stories are interviews and most of them are in German. The rest of the links, while not including Andreas content, are relatively useful. They contain mostly web technology content with an emphasis on web2.0 and twitter.
The homework wanted us to explore systems that recommend the most likely URLs and systems that recommend the most unique URLS. With this algorithm you can add uniqueness by changing the tags used in step two above. Instead of using the top five most used tag words you can introduce uniqueness by replacing often used tag words by unique tag words.  You can scale the uniqueness by replacing only a few of the most used words all the way to only using unique tag words.  

For the second list used the same algorithm as above but choose to use the least used tag words for the input URL.  This list was a little less useful considering you were basing it on www.weigend.com, which is the homepage of Andreas Weigend. It contained no Andreas content but was still largely made up of web technology articles once again including web2.0 and twitter.

Final Thoughts

Overall this was a pretty cool thing to try. I am disappointed in the del.icio.us API being so sensitive to query frequency. I completely understand why, but am disappointed nonetheless. I initially had much grander ranking sechmes but they required far more data than I could obtain without getting blocked, but with all that aside, I have to say, Good assignment!