RSS issue with unicode characters

All errors or issues on the website/forums should be reported here and will be investigated. Support questions about how to use/access features not documented in the forum FAQ and discussion about the site forums are also welcome here, but please read this thread before posting such items.
Post Reply
hyperair
Experienced
Experienced
Posts: 202
Joined: Sun Oct 14, 2007 4:04 pm

RSS issue with unicode characters

Post by hyperair » Wed Dec 26, 2007 6:24 pm

Hi it's me again and I bring you another RSS issue :D

Here's the link:
http://validator.w3.org/feed/check.cgi? ... %2Frss.php

As seen here, there's some strange character there which can't be shown and violates the standards. So, my RSS reader, Liferea is giving problems again! I'm not sure if other RSS readers also have this problem.

Solution: exchange all characters out of the ASCII table with appropriate &#xxxx representations.
User avatar
Kieran
RouterTech Team
RouterTech Team
Posts: 2671
Joined: Fri Jan 20, 2006 11:30 am
Location: London
Contact:

Post by Kieran » Sat Feb 02, 2008 11:14 pm

Sorry for the delay in getting back to you on this. I did manage to re-create your issue and I have started a thread on the team forum to discuss how to fix this issue.

The problem is that the RSS feed simply pulls the post data straight out of the database and can only use the same conversion functions as provided by phpBB, such as BBcode etc. It seems as though the strictness of HTML, XHTML and the like is less than RSS and so the inbuilt phpBB text parsing functions are sufficient to display contents of the posts database to an HTML/XHTML page but not to an RSS feed.

The method I came up with in my head to fix this issue is to create a table of ASCII characters and then parse everything shown the the feed through a string parser, replacing things as required, in order to fix the issue. The only minor issue is that doing this would cause quite a load on the server as the feed is recreated for every page view, and parsing would involve quite a number of on the fly checks.

The only way to mitigate such a load would be to do the conversion as posts were placed into the database, but I feel this might create more problems than it solves given the number of ways in which data can get into the DB aside from the usual posting screen.

The other option would be to implement the parsing combined with some kind of caching for the RSS page, but seeing as we don't do any caching at all in the community at present this seems like a bit of an overkill for just one area.

I'm open to all suggestions as I'd like to fix this for you but I'm going round in circles at the moment.
Kieran
"Indeed!"
Invaluable links: Forum Rules | Networking Guides | FAQ | Site Search | Forum Search <-- Use it or feel my wrath!
No support via PM, please ask your questions in the forum!
hyperair
Experienced
Experienced
Posts: 202
Joined: Sun Oct 14, 2007 4:04 pm

Post by hyperair » Sun Feb 03, 2008 7:07 am

How about changing the build time of the RSS feed to every new post? Like, you put an rss.xml somewhere on your server, and every time there's a new post, this rss.xml is regenerated. I reckon it's quite feasible considering the RSS feed gets requested more often than the posts come in.

Of course, doing that will then require a lock file so that you don't corrupt the RSS feed in a race condition.

EDIT: Also, a method of converting unicode characters without needing a lookup table is using the ord() function. But first you need to be able to use the multi-byte string. Then each unicode character can be represented by... &#<return value from ord()>;
Post Reply