Tag Archives: database

Free Online CS Courses Taught by Stanford Faculty Q1 2012

Free Online Courses

I’m currently taking the free Intro to Databases course taught by the excellent Professor Widom. The course consists of video lectures with homework exercises, quizzes and midterms. There is also a helpful Q&A forum for students when you get stuck.

In the beginning of 2012, there are more free online courses available, including:

The classes are high quality and perfect for you if you can spend more than a few hours a week per course. As they are free, the courses don’t offer any official Stanford certificates, grades, or credit. You do get a Statement of Accomplishment from the instructor.

I’d recommend these courses to anyone as they are free, and any investment in your time will be easily rewarded with an understanding in increasingly relevant present-future topics.

Free Stanford Intro to Databases course

Starting October 10th, 2011:

A bold experiment in distributed education, “Introduction to Databases” is being offered free and online to students worldwide, October 10 – December 12, 2011. Students have access to lecture videos, are given assignments and exams, receive regular feedback on progress, and participate in a discussion forum. Those who successfully complete the course will receive a statement of accomplishment. Taught by Professor Jennifer Widom, the curriculum draws from Stanford’s popular Introduction to Databases course.

If you’re interested in software or technology, databases are incredibly relevant to modern day computing. This free course may be technologically tough, but it will reward you with insights that you will be able to apply somewhere.

URL Duplication

Let me first off start of by saying there may be “the right way” to do this, but I have not found it yet. A site such as Digg would have advanced techniques to deal with this.

In my given scenario, a user enters in a URL and results are presented. Ideally, each unique URL only shows up once in the database. That way, no matter how the URL is entered in, the user ends up at the correct landing page unique to that URL without duplicated entries in the database. Duplicate entries make the system less efficient as time is spent on the multiple instances of the URL as opposed to just one location.

For example, I may enter in google.com and another person http://www.google.com/ for their input. These two cases refer to the same intention, but they are not the same. With many websites opting to remove the www. in front through server-side scripting, this can start to get tricky. The front elements typically include http:// and www.

Some websites use a subdomain and do not accept a www. in front. For example, http://ps3.ign.com is one such example.

Another example are appended modifiers. With a URL such as http://www.nytimes.com/2010/01/09/nyregion/09gis.html?partner=rss&emc=rss the ?partner=rss&emc=rss is not necessary for a user to view the site. This can cause duplication in the database.

Unfortunately, I assumed that duplicate entries are inevitable and a fact of life. As such, my goal would be to prevent and fix duplicates – not to eliminate them entirely.

The way that I addressed this was to do a lookup of variants of user input. Extra energy spent? Yes. Duplicates reduced? Hopefully.

The preventative measure:
So for an input, I would concatenate several strings and check for matches. I had the script put a mix of http://www. in front, http:// in front, and / in back. These were all ran through matches with the relevant table column. If any of them returned a positive, exact match, I would route the input request to accordingly.

The remedial measure
To deal with a duplicate entry in the table, I created an additional column in the database. By default, the redirect value is null. If the value is set, I would have the routing script redirect there upon any requests to go to the duplicated page.

With any given URL, duplicates are highly likely. Take for instance that with any URL, the URL itself may have several variants that are valid for use (google.com VS http://www.google.com). Also, many pages have appended $_GET values (such as ?partner=rss&emc=rss). Then, the recent mass resurgence in URL shorting services (bit.ly) add another layer of URLs that all redirect to the net same page.

It seems to me that duplicated URLs in a data set where each URL is intended to represent a unique page is inevitable given a large enough collection.