Conditional GETs in App Engine
I’m currently working on an app in Google App Engine that polls feeds periodically and then does stuff with them. I suppose I could use that pubsubhubbub thingy but I have a feeling that most feeds aren’t using this yet.
Anyway, I did a quick naive implementation of polling about every hour or so. Apparently the feed parser I’m using is pretty inefficient because it’s eating up a lot of resources (relatively speaking) on App Engine. I remembered that the http protocol is pretty smart, and there’s a way to figure out if stuff has changed since the last time you grabbed it.
Google’s urlfetch doesn’t seem to support conditional GETs (someone tell me if I am wrong). I looked around and found a few tutorials on how to accomplish this in Python using urllib2. The tutorials weren’t exactly what I wanted, so I had to change a few things here or there. Here’s a snippet of code that I’m using:
import urllib2 feed = Feed.get() #my feed object has a etag, last_modified and url property req = urllib2.Request(url) if feed.etag: req.add_header("If-None-Match", feed.etag) if feed.last_modified: req.add_header("If-Modified-Since", feed.last_modified) try: url_handle = urllib2.urlopen(req) content = url_handle.read() headers = url_handle.info() feed.etag = headers.getheader("ETag") feed.last_modified = headers.getheader("Last-Modified") feed.put() except Exception, e: logging.info(e) #just says 304 didn't change return dostuffwith(content)
This handles my use case, which is doing work if the feed is new, and ignoring it if it hasn’t been modified. I could probably wrap this into a function that returned false if it the file hadn’t changed, and the content if it was new… Probably will do that next.