KBD

Keith Devens .com

Wednesday, July 23, 2008 Flag waving
All I want is to have my peace of mind. – Boston (Peace of Mind)
← ISO 8601 date/timesHow annoying →

Daily link icon Monday, June 3, 2002

RSS auto-discovery with PHP

I'm in the process of writing my own RSS aggregator. Naturally, I wanted to be able to use the new RSS auto-discovery method which has evolved over the past few days. Mark Pilgrim made some Javascript bookmarklets and a Python implementation to do this, but I needed a PHP implementation, so I wrote one Smiley. First the code, and then I'll explain:

<?php
function getRSSLocation($html$location){
    if(!
$html or !$location){
        return 
false;
    }else{
        
#search through the HTML, save all <link> tags
        # and store each link's attributes in an associative array
        
preg_match_all('/<link\s+(.*?)\s*\/?>/si'$html$matches);
        
$links $matches[1];
        
$final_links = array();
        
$link_count count($links);
        for(
$n=0$n<$link_count$n++){
            
$attributes preg_split('/\s+/s'$links[$n]);
            foreach(
$attributes as $attribute){
                
$att preg_split('/\s*=\s*/s'$attribute2);
                if(isset(
$att[1])){
                    
$att[1] = preg_replace('/([\'"]?)(.*)\1/''$2'$att[1]);
                    
$final_link[strtolower($att[0])] = $att[1];
                }
            }
            
$final_links[$n] = $final_link;
        }
        
#now figure out which one points to the RSS file
        
for($n=0$n<$link_count$n++){
            if(
strtolower($final_links[$n]['rel']) == 'alternate'){
                if(
strtolower($final_links[$n]['type']) == 'application/rss+xml'){
                    
$href $final_links[$n]['href'];
                }
                if(!
$href and strtolower($final_links[$n]['type']) == 'text/xml'){
                    
#kludge to make the first version of this still work
                    
$href $final_links[$n]['href'];
                }
                if(
$href){
                    if(
strstr($href"http://") !== false){ #if it's absolute
                        
$full_url $href;
                    }else{ 
#otherwise, 'absolutize' it
                        
$url_parts parse_url($location);
                        
#only made it work for http:// links. Any problem with this?
                        
$full_url "http://$url_parts[host]";
                        if(isset(
$url_parts['port'])){
                            
$full_url .= ":$url_parts[port]";
                        }
                        if(
$href{0} != '/'){ #it's a relative link on the domain
                            
$full_url .= dirname($url_parts['path']);
                            if(
substr($full_url, -1) != '/'){
                                
#if the last character isn't a '/', add it
                                
$full_url .= '/';
                            }
                        }
                        
$full_url .= $href;
                    }
                    return 
$full_url;
                }
            }
        }
        return 
false;
    }
}
?>

The function takes two arguments. The raw HTML and the location you got it from. The location has to be there so relative links to the RSS file can be resolved.

First, PHP doesn't have an SGML parser built in like Python seems to, so I had to get all the <link> tags myself. A few regular expressions and some simple string splitting made it all real easy. Next, you cycle through all the links found, figure out which one is the RSS file, resolve it to an absolute URL if necessary, and return it.

I'll gladly take any suggestions for improving the code, and if I change it I'll update it here. Hope this helps.

Update: if you need a function to retrieve the HTML for you, feel free to use mine.

<?php
function getFile($location){
    
$ch curl_init($location);
    
curl_setopt($chCURLOPT_FOLLOWLOCATION1);
    
curl_setopt($chCURLOPT_HTTPHEADER, array('Connection: close'));
    
curl_setopt($chCURLOPT_RETURNTRANSFER1);
    
curl_setopt($chCURLOPT_TIMEOUT15);
    
$response curl_exec($ch);
    
curl_close($ch);
    return 
$response;
}
?>

This uses PHP's interface to the excellent cURL library, so this requires PHP 4.02 or higher.

Update: cool, mentioned on Scripting News! But oy! My name is spelled wrong. Later: he's fixed it, thanks Dave! Also seen on dive into mark and Keith Gaughan.

Update (6/6/02): Fixed a small bug where the RSS location returned would be incorrect if the RSS location had a query string with equals signs.

$att = preg_split('/\s*=\s*/s', $attribute);

had to be

$att = preg_split('/\s*=\s*/s', $attribute, 2);

so that it would only return up to two values, and not split on any more equals signs. Found while retrieving the RSS feed for Langreiter.com Smiley.

Update (6/19/02): small improvement thanks to Pepino.
Update (7/26/02): added a timeout to the getFile function.
Update (4/15/03): made it work with error_reporting(E_ALL), and changed the way it resolves relative URLs (see comments for more info).
Update (9/15/03): Note: this code is public domain.

← ISO 8601 date/timesHow annoying →

Comments XML gif

Keith Gaughan (http://hereticmessiah.weblogs.com/) wrote:

Well, he's went and fixed it now.

∴ Keith Gaughan | 3-Jun-2002 10:57pm est | http://hereticmessiah.weblogs.com/ | #457

Keith (http://www.keithdevens.com/) wrote:

Yeah, I started to put the message I just put there a while ago, but my computer was wonky and then I had to go out for the night, so I didn't get to do it.

Keith | 3-Jun-2002 11:06pm est | http://www.keithdevens.com/ | #458

Nicolas Hoizey <nhoizey@php.net> (http://www.phpheaven.net/) wrote:

It seems there is a bug somewhere in the URL creation.

I try using your script with the folowing URL :
http://www.phpheaven.net/rubrique14.html

It should give me this RSS feed :
http://www.phpheaven.net/rss2.xml

Bug instead it gives me this :
http://www.phpheaven.net/rubrique14.html/rss2.xml

Anyway, great job!

∴ Nicolas Hoizey <nhoizey@php.net> | 15-Apr-2003 8:48am est | http://www.phpheaven.net/ | #1820

Keith (http://www.keithdevens.com/) wrote:

Hi Nicolas. I should have been clearer in my documentation. The $location you're supposed to pass as the second parameter to the getRSSLocation function is supposed to be the base location, not the page itself. So in your case, the base location would be http://www.phpheaven.net/, not http://www.phpheaven.net/rubrique14.html.

However, it was easy to change the behavior to match what you expected, so I changed the code. I also updated it to work with error_reporting(E_ALL) on. So grab the new code and give it a try, and let me know if you have any problems.

Keith | 15-Apr-2003 1:41pm est | http://www.keithdevens.com/ | #1821

Jason DeFillippo (http://jason.defillippo.com/blog/) wrote:

Love it! Only suggestion would be to return an array of all possible hits. I have 2 RSS feeds on my site. One .rdf and one .xml but running the site's html thru your func it just returns the rdf. Beautiful work though. I'll definitely be using it.

∴ Jason DeFillippo | 1-Jun-2003 11:02pm est | http://jason.defillippo.com/blog/ | #2125

Keith (http://www.keithdevens.com/) wrote:

Jason, I'll consider it, but considering that this code will be used for web-based applications, where it's not easy to pop up a dialogue box to choose which feed you want, and in the absence of further metadata (like "full posts" or "excerpts") to allow the code to automatically choose a preferred version, it's likely that returning more than one result would be a mis-feature.

However, the code should have predictable behavior when faced with more than one option (for instance, always return the first feed listed), a possibility I never considered in the first place. So, I should probably look at the code to see what happens and see if that's preferable.

Keith | 2-Jun-2003 1:56am est | http://www.keithdevens.com/ | #2126

Jason DeFillippo (http://jason.defillippo.com/blog/) wrote:

The need to gather all feed possibilities was what I needed in my web app so I just hacked the feature into your func so no worries :-) Thanks for the quick reply.

∴ Jason DeFillippo | 2-Jun-2003 2:57pm est | http://jason.defillippo.com/blog/ | #2130

Keith (http://www.keithdevens.com/) wrote:

Good! Glad you were able to make due. There was no way I would have gotten to it soon, so it's good that you did what you did Smiley

Keith | 2-Jun-2003 8:20pm est | http://www.keithdevens.com/ | #2131

Nicolas (http://nicolashb.free.fr) wrote:

Anyone has implemented this in ASP?

∴ Nicolas | 10-Sep-2003 7:29am est | http://nicolashb.free.fr | #2893

Keith (http://keithdevens.com/) wrote:

Dunno, I did a quick Google search just now and didn't see anything.

Keith | 10-Sep-2003 12:20pm est | http://keithdevens.com/ | #2895

philip (http://www.philipandrew.com/) wrote:

Does work for http://www.upsaid.com/beyan/ see there is a RSS feed for this page at http://www.101h.com/beyan/feed.xml

Thanks!

∴ philip | 4-Dec-2003 8:01pm est | http://www.philipandrew.com/ | #3448

Max wrote:

Keith, could you give me a live example : which values should I assign to $html and $location in order to get the feed from http://www.phpheaven.net/ ?

I tried this with no luck Smiley frowning

$location = 'http://www.phpheaven.net/';
$html = getFile($location);
getRSSLocation($html, $location);

Thanks,

Max

∴ Max | 4-Jun-2004 2:55pm est | #4728

Keith (http://keithdevens.com/) wrote:

Max, I tried exactly what you gave me and it worked fine, returning http://www.phpheaven.net/rss1.xml.

Note that you can use file_get_contents() instead of getFile() if you'd like, now that there's a function built into PHP that does exactly that.

Keith | 4-Jun-2004 3:50pm est | http://keithdevens.com/ | #4729

Max wrote:

Thanks Keith, it appeared the server I was working on had crashed... Works like a charm now ! thank you :-)

∴ Max | 5-Jun-2004 7:23am est | #4732

steve (http://www.buzznet.com) wrote:

Keith. This kicks major ass. I love your brute-force approach. I was messing with using SAX parsers etc. What a mess. Your soloution is brief and accurate! thanks, dude.

∴ steve | 21-Aug-2004 2:42am est | http://www.buzznet.com | #5316

some body wrote:

FYI, I just found your function in the code for the zfeeder web aggregator application:

http://zvonnews.sourceforge.net/

∴ some body | 19-Oct-2005 9:04am est | #8493

Enej wrote:

great job. I was woundering what happends to if the user would accually put in a url to the feed instead of the html site?
it would be cool if it would point you to the feed regardless.

Thanks.

∴ Enej | 22-Nov-2005 3:47am est | #8720

81.10.126.86 wrote:

how about trackback URL auto discovery ?

∴ 81.10.126.86 | 15-Mar-2006 4:10am est | #9307

Mark_S wrote:

I tried this with no luck

$location = 'http://www.phpheaven.net/';
$html = getFile($location);
getRSSLocation($html, $location);

-------------------------
I'm struggling to get this to work?
I know its more down to my php confusion with calling
functions, like Max above.

Any help would be appreciated.
As my searches for Auto Discovery bring me back to this
page time and time again.

-------------------------
Does the above code make a .php page "Auto Discovered"
so to speak.
How do i include it in my php?
An example would be much appreciated.

My page that i would like "Auto Discovery" is php?
I can using the default tags on a html page,
get auto discovery to work.
But i can not get a php page to Auto discover !

I'm newbie / noivice level..

Thanks in advance Mark.

∴ Mark_S | 18-Mar-2006 12:46pm est | #9327

Cristian wrote:

I am sorry Keith Devens, but I had to modify the function to get all the feeds on the page. This is the code:

function getRSSLocation($html, $location){
    if(!$html or !$location){
        return false;
    }else{
        #search through the HTML, save all <link> tags
        # and store each link's attributes in an associative array
        preg_match_all('/<link\s+(.*?)\s*\/?>/si', $html, $matches);
        $links = $matches[1];
        $final_links = array();
        $link_count = count($links);
        for($n=0; $n<$link_count; $n++){
            $attributes = preg_split('/\s+/s', $links[$n]);
            foreach($attributes as $attribute){
                $att = preg_split('/\s*=\s*/s', $attribute, 2);
                if(isset($att[1])){
                    $att[1] = preg_replace('/([\'"]?)(.*)\1/', '$2', $att[1]);
                    $final_link[strtolower($att[0])] = $att[1];
                }
            }
            $final_links[$n] = $final_link;
        }
        #now figure out which one points to the RSS file
        for($n=0; $n<$link_count; $n++){
            if(strtolower($final_links[$n]['rel']) == 'alternate'){
                if(strtolower($final_links[$n]['type']) == 'application/rss+xml'){
                    $href = $final_links[$n]['href'];
                }
                if(!$href and strtolower($final_links[$n]['type']) == 'text/xml'){
                    #kludge to make the first version of this still work
                    $href = $final_links[$n]['href'];
                }
                if($href){
                    if(strstr($href, "http://") !== false){ #if it's absolute
                        $full_url[] = $href;
                    }else{ #otherwise, 'absolutize' it
                        $url_parts = parse_url($location);
                        #only made it work for http:// links. Any problem with this?
                        $full_url[] = "http://$url_parts[host]";
                        if(isset($url_parts['port'])){
                            $full_url[count($full_url)-1] .= ":$url_parts[port]";
                        }
                        if($href{0} != '/'){ #it's a relative link on the domain
                            $full_url[count($full_url)-1] .= dirname($url_parts['path']);
                            if(substr($full_url[count($full_url)-1], -1) != '/'){
                                #if the last character isn't a '/', add it
                                $full_url[count($full_url)-1] .= '/';
                            }
                        }
                        $full_url[count($full_url)-1] .= $href;
                    }
                    //return $full_url;
                }
            }
        }
        if (isset($full_url)) {
            return $full_url;
        } else {
            return false;
        }
    }
}
∴ Cristian | 2-Oct-2006 10:46am est | #9695

Jake wrote:

Great code! Thanks a million, and thanks to Cristian for the ability to pull the feeds in an array.

∴ Jake | 7-Mar-2007 3:40pm est | #10011

Comments closed.

July 2008
SunMonTueWedThuFriSat
 12345
6789101112
13141516171819
20212223242526
2728293031 



RSS feed RSS feed for Keith's Weblog
Atom feed Atom feed for Keith's Weblog
Weblog archive

Generated in about 0.285s.

(Used 8 db queries)

mobile phone