So, you’ve just put together a really good-looking résumé, saved it out as a “preserve my formatting” PDF file with clickable links, and you’re ready to go job-hunting. You email it out to a bunch of promising companies, not to mention a few recruiting agencies, just in case.
Then, horror of horrors, the previous company you worked for does a “website upgrade”, changing the structure of all their web addresses. Suddenly half of the links in your résumé — which is already in the hands of potential employers — are broken. Dead, link-rotting away. Not a good look for someone who’d called himself an “accomplished web developer”.
You hit yourself and wished you’d piped your URLs through some URL redirection service that allowed you to change where they pointed to later. Happily, this is one of the services DecentURL provides.
Then you think, “Hey, it’d be nice if the redirection service could automatically email me when my links went bad, so I didn’t find out three weeks later from my friend’s cousin’s son.”
But (what a coincidence!) DecentURL’s premium services do that too. I’ve implemented a system that checks your URLs for dead pages every three days, and if any of them are bad, it lets you know.
Soft 404s and cleverly detecting dead pages
It turns out to be non-trivial to detect dead pages. Some web servers, instead of returning
Not Found on dead pages (the 404 error code), return
OK (200) and present you with the home page, or redirect you somewhere else. (I wish we could all just follow the standards.)
Alas. Here I’d thought that checking for dead pages would be this simple:
def is_dead(url): try: fp = urllib2.urlopen(url) fp.read() return False except urllib2.HTTPError: return True
So I dreamed up a few ad-hoc ways to try and detect fake error pages (does the URL give me the home page? if so, it’s a bad link), but then I discovered a paper on the web’s decay by some IBM research guys.
Section 3 calls the fake
200 OK errors “soft 404 pages”, and gives some pseudo-code for and an explanation of a fairly simple and general algorithm for detecting dead pages.
I’ve turned this into little Python library, soft404.py. Feel free to use that in your own stuff — though I’d be interested in hearing about what you’re working on if you do.
How it works
Here’s just a quick overview of the algorithm, taken from the comment at the top of my code:
Basically, you fetch the URL in question. If you get a hard 404, it’s easy:
the page is dead. But if it returns
200 OK with a page, then we don’t
know if it’s a good page or a soft 404. So we fetch a known bad URL (the
parent directory of the original URL plus some random chars). If that
returns a hard 404 then we know the host returns hard 404s on errors,
and since the original page fetched okay, we know it must be good.
But if the known dead URL returns a
200 OK as well, we know it’s a host
which gives out soft 404s. So then we need to test the contents of the
two pages. If the content of the original URL is (almost) identical to
the content of the known bad page, the original must be a dead page too.
Otherwise, if the content of the original URL is different, it must be a
That’s the heart of it. HTTP redirects complicate things just slightly, but not much. For more info, see my code or read the paper.
You’re still reading? Good going. I’d be honoured if you’d sign up for DecentURL’s premium services, which use this algorithm, otherwise just have fun using the code!
25 January 2008 by Ben 3 comments