me

Parsing dates for web feeds in .NET

Posted on 6/26/2008

During the initial development of FeedFly last year, I was exposed to the evils of date parsing for web feeds.  While working on release 0.2.1, I revisited this issue when fixing a date parsing bug. Before I forget all of this, it helps to write it down.

Feed formats and dates

If you decide to start syndicating a web feed these days, you're basically limited to the two most popular feed formats: RSS or Atom.  RSS seems to have the upper hand right now in terms of popularity, but some believe Atom will eventually take over because it's being pushed as an official standard. For the purposes of a web feed, both of these formats need to be able to store and transmit dates and times.  Each format has it's own date and time format, however. 

RSS

RSS feeds are supposed to use the date format outlined in section 5 of RFC 822 (published in 1982). Here are some example RFC 822 dates:

Wed, 28 May 08 07:00:00 EDT
Thu, 10 Apr 08 02:30:00 UT
Tue, 10 Jun 08 05:15:00 -0600
Thu, 17 Apr 08 22:45:00 GMT

In 1989, RFC 1123 was created to update the RFC 822 format to use 4-digit years (section 5.2.14). Also included was a recommendation that all dates should be limited to GMT (universal time) or include a numerical offset. So, for RFC 1123's purposes, only the last date in the above list should be used. But it's supposed to be backwards compatible. Don't take my word for it, though:

There is a strong trend towards the use of numeric timezone indicators, and implementations SHOULD use numeric timezones instead of timezone names. However, all implementations MUST accept either notation. If timezone names are used, they MUST be exactly as defined in RFC-822.

I interpret this to mean that while RFC 1123 strongly encourages you to use numerical offsets or GMT, you could ignore the recommendation and still be in compliance. Because of this, the following dates should be considered to be in both RFC 822 and RFC 1123 format. However, only the bottom 2 are recommended by RFC 1123.

Wed, 28 May 2008 07:00:00 EDT
Thu, 10 Apr 2008 02:30:00 UT
Tue, 10 Jun 2008 05:15:00 -0600
Thu, 17 Apr 2008 22:45:00 GMT

Researching the history of RSS lead to some dark places on the web. This format was evidently born out of a lot of heated "collaboration". All versions of RSS history (controversial or not) I could dig up stated it began in 1997. The reason I looked up the history was to figure out why it's flavor of RFC 822 did not follow the recommendations in RFC 1123, or make any references to it.  Well, I didn't find an answer for that, and it's even more perplexing given that the HTTP specification was recommending RFC 1123 exclusively as early as 1996. Here's a revealing excerpt:

Sun, 06 Nov 1994 08:49:37 GMT ; RFC 822, updated by RFC 1123

... [this date format] is preferred as an Internet standard and represents a fixed-length subset of that defined by RFC 1123 [6] (an update to RFC 822 [7]).

Why am I spending so much precious time on this?  .NET only parses RFC 822 dates according to RFC 1123's recommendations, that's why. They (the .NET team) are in safe territory here (I guess) since they can simply point to HTTP's exclusive use of RFC 1123, but more on this later.

Atom

The Atom format specifies that its dates must conform to RFC 3389, which defines a simple profile of the ISO 8601 date format. Just looking at the number of the RFC indicates it's more recent. Proposed in 2002, RFC 3389 is 20 years younger than RFC 822, so you'd expect it to include some "lessons learned" from previous date formats.

Here is an example date in  in RFC 3389: 1985-04-12T23:20:50.52Z

The benefit of this date format is that it's more efficient to parse from a computer's perspective.  It does this at the cost of being less user friendly, since we now don't know what day of the week it is or what time zone offset it was created in. Of course, modern software frameworks  like .NET or Java can easily produce most of this user friendly information for you after calling upon their rich date and time libraries.

Date parsing in .NET

For parsing date time strings .NET offers us DateTime.Parse and DateTime.ParseExact. Both can parse a list of  standard date and time formats included in the framework. ParseExact can also parse custom date and time formats.

Parsing RSS dates

For RSS-type dates, .NET includes the standard date format string "r" for RFC 1123. When used in combination with DateTime.ToString, "r" will always produce a universal date with a "GMT" suffix.  However, for parsing purposes, .NET will parse a numerical offset without complaint. However, as hinted to above, .NET will throw an exception if it encounters a time zone name as specified in RFC 822. 

What do you think about this?  Should the BCL team be forced to abide by RFC 1123's command that it "MUST accept either notation"? I think so. RFC 822 is certainly not going to change which time zones it contains, so hard-coding this into DateTime somewhere couldn't hurt.  It was certainly easy enough for me to build a ListDictionary solution as a workaround.

Furthermore, given the fact that Java is about to release an updated API for date parsing (JSR 310) that will most likely parse these time zone names, .NET's going to have to catch up (hint). Now, I don't know this for a fact regarding JSR 310, but from what I could find from searching, they're considering a lot of ideas from the popular joda-time library.

Parsing Atom dates

After all that mess with RFC 822, Atom's use of RFC 3389 is refreshing from a parsing perspective. .NET includes the "o" format for parsing what appears to be RFC 3389 dates.  However, MSDN says this format is for ISO 8601. Given the huge variety of date formats in ISO 8601 (the W3C commented briefly about this), I'd prefer the documentation to be a bit more specific about what it will and will not parse. But hey, it parses just fine, so I'll take it.

I prefer Atom

I have no idea what preference the small slice of the development community that writes .NET feed reader software has in terms of feed formats.  But I'm sure they appreciate Atom when it comes time to parse dates. I don't know if it's a freedom thing or a popularity thing, but I see much more abuse of date formats in RSS feeds than I do in Atom feeds.  For example, AutoBlog's RSS feed uses RFC 3389 dates in its pubDate element. That's a no-no, but I still parse it.

 

Technorati Tags: ,,,,

No comments:

Post a Comment