Link rot, soft 404s, and DecentURL

Peugeot 404Go straight to the “soft 404” detector code.

The problem

So, you’ve just put together a really good-looking résumé, saved it out as a “preserve my formatting” PDF file with clickable links, and you’re ready to go job-hunting. You email it out to a bunch of promising companies, not to mention a few recruiting agencies, just in case.

Then, horror of horrors, the previous company you worked for does a “website upgrade”, changing the structure of all their web addresses. Suddenly half of the links in your résumé — which is already in the hands of potential employers — are broken. Dead, link-rotting away. Not a good look for someone who’d called himself an “accomplished web developer”.

A solution

You hit yourself and wished you’d piped your URLs through some URL redirection service that allowed you to change where they pointed to later. Happily, this is one of the services DecentURL provides.

Then you think, “Hey, it’d be nice if the redirection service could automatically email me when my links went bad, so I didn’t find out three weeks later from my friend’s cousin’s son.”

But (what a coincidence!) DecentURL’s premium services do that too. I’ve implemented a system that checks your URLs for dead pages every three days, and if any of them are bad, it lets you know.

Soft 404s and cleverly detecting dead pages

It turns out to be non-trivial to detect dead pages. Some web servers, instead of returning Not Found on dead pages (the 404 error code), return OK (200) and present you with the home page, or redirect you somewhere else. (I wish we could all just follow the standards.)

Alas. Here I’d thought that checking for dead pages would be this simple:

def is_dead(url):
    try:
        fp = urllib2.urlopen(url)
        fp.read()
        return False
    except urllib2.HTTPError:
        return True

So I dreamed up a few ad-hoc ways to try and detect fake error pages (does the URL give me the home page? if so, it’s a bad link), but then I discovered a paper on the web’s decay by some IBM research guys.

Section 3 calls the fake 200 OK errors “soft 404 pages”, and gives some pseudo-code for and an explanation of a fairly simple and general algorithm for detecting dead pages.

I’ve turned this into little Python library, soft404.py. Feel free to use that in your own stuff — though I’d be interested in hearing about what you’re working on if you do.

How it works

Here’s just a quick overview of the algorithm, taken from the comment at the top of my code:

Basically, you fetch the URL in question. If you get a hard 404, it’s easy: the page is dead. But if it returns 200 OK with a page, then we don’t know if it’s a good page or a soft 404. So we fetch a known bad URL (the parent directory of the original URL plus some random chars). If that returns a hard 404 then we know the host returns hard 404s on errors, and since the original page fetched okay, we know it must be good.

But if the known dead URL returns a 200 OK as well, we know it’s a host which gives out soft 404s. So then we need to test the contents of the two pages. If the content of the original URL is (almost) identical to the content of the known bad page, the original must be a dead page too. Otherwise, if the content of the original URL is different, it must be a good page.

That’s the heart of it. HTTP redirects complicate things just slightly, but not much. For more info, see my code or read the paper.

The end

You’re still reading? Good going. I’d be honoured if you’d sign up for DecentURL’s premium services, which use this algorithm, otherwise just have fun using the code!

25 January 2008 by Ben    3 comments

3 comments and pings (oldest first)

Paul 27 Jan 2008, 03:40 link

This is similar to functionality used by this firefox extension: http://www.openly.com/linkevaluator/ (fair warning: I work for that company, though I had nothing to do with creating that software).

The key difference is that extension uses green/red flag phrases to determine the ‘goodness’ of a link. For the resume example this is simple. Set a green flag phrase to your name. Since a soft404 is unlikely to return a page about you (on a corporate website), finding your name in the content of the page is a good sign that you have the right page.

The downside to this approach is that picking meaningful green and red flag phrases is something that requires a human — or at least a lot of statistical data and some analysis.

We do, of course, have a real application defined around a similar purpose, which compares links found on the same host (though for different content) and ranks them accordingly to indicate which is likely to be soft404 error. That process is almost exactly what you describe here except we include a few items like page encoding (which turns out to be surprisingly useful) and response time in our comparison. Since we are primarily focused on scholarly linking, the task is somewhat easier than dealing with all links on the internet, so we are able to manually gather phrases for many of the sites we are going to need to check.

Paul

Ben 28 Jan 2008, 20:17 link

Hey, that’s pretty interesting, Paul. And a good run-down of a good-looking Fx extension to boot. :-) Thanks for the heads-up on page encoding being useful.

[…] How To Catch “soft” 404 Errors Some sites don’t return the HTTP 404 response when a nonexistant page is requested. Instead, they show something else, e.g. a search page or suggestions based on the URL. This is called a “soft” 404. This article explains how you can detect this kind of 404s programmatically. Related posts: […]