subscribe
Tags:
 
2012
2011
2010
December
November
October
September
August
July
June
May
April
March
February
January
2009
December
November
October
September
August
July
June
May
April
March
February
January
2008
2009-01-30
A while ago, I needed to spider all of the pages in a dynamically generated website and do something with the data. Since the spider only needed to work with the one website, I was free to add as much site specific code as I needed. The script has been copied and adjusted to fit the needs of the whatever site I happen to be working on at the time. In a nut shell, the script gets passed a web page address, downloads the page, uses a regex to find all of the links, parses the links, and then recurses through the links.

The current iteration of the script spiders a "url-beautified" site and generates a very basic sitemap suitable for submission to Google or Yahoo, and the current code looks like this:
#!/usr/bin/env php <?php function getrealPath($baseDir,$link) { $newDir=''; $rise=0; //count the number of ../ in the link for($i=0;$i<strlen($link)-4 ; $i++ ) { if($link[$i]=="." && $link[$i+1]=="." && $link[$i+2]=="/") { $rise++; } } $folders = explode("/",$baseDir); //print_r($folders); for($i=0 ; $i < count($folders)-$rise ; $i++) { $newDir .= $folders[$i]."/"; } $link = str_replace("../","",$link); $link = str_replace(" ","%20",$link); #replace double forward slashes //$link = str_replace("//","/",$link); return $newDir.$link; } function getLinks($file,$parent='') { $dirname=''; global $brokenLinks; global $startTime; global $pageList; global $brokenLinksString; $text = @file_get_contents($file);//get the text from the file if(trim($text)=="")//the file can't be opened { $brokenLinksString.="$parentnt$filenn"; echo "--Missing File--nt$filen"; return 1; } echo "$filen"; $text = str_replace(array("href =","HREF =","href= ","HREF= ","HREF="),"href=",$text); preg_match_all("/href=["|'](.*?)["|']/", $text, $link_results); foreach ($link_results[1] as $link) { $link_is_absolute=false; //clean up ampersands in the link $link = str_replace("&","&",$link); if(stristr($link,"mailto") ) { continue; } $processedLink = strtolower($link); //find out if the link is external before we figure out the realLink if (stristr($processedLink,"http:") || stristr($processedLink,"https:") ) { if(strpos($processedLink,BaseHREF)===0 ) { //this link is and absolute path }else{ //this link is external continue; } } //ignore all extensions if(strstr($processedLink,".") || strstr($processedLink,"mailto:") ) { continue; } //if the file is absolute local we don't need to discover the abspath if($link_is_absolute) { $absLink = $processedLink; }else{ $absLink = getRealPath(BaseHREF."$dirname",$processedLink); } if( !in_array($absLink,$pageList)) { array_push($pageList,$absLink); getLinks($absLink,$file); } } return 1; } function xmlEncode($string) { $newString = str_replace(array("&","'",""",">","<"),array("&","'",""",">","<"),$string); return $newString; } /*begin the script*/ #make some global variables $pageList = array(); $brokenLinksString=""; $brokenLinks = array(); define ("BaseHREF",$argv[1]); getLinks(BaseHREF); //find the links in the file //make the sitemap foreach($pageList as $filePath) { //echo $filePath."n"; //create the url for our Google Site Map $subPath = explode("/",$filePath); $folderDepth = count($subPath)-1; $subPath[$folderDepth] = xmlEncode( $subPath[$folderDepth] ); $encodedPath = implode("/",$subPath); $loc = $filePath; $urlString.= "tn"; $urlString.= "tt$locn"; if(stristr($filePath,"home.") ) { $priority = 1-($folderDepth*0.1); }else{ $priority = 0.8-($folderDepth*0.1);; } //$priority=( stristr($filePath,"home.") )?0.9:0.5; $urlString.="tt$priorityn"; $urlString.= "tn"; } $siteMapText="n"; $siteMapText.="; $siteMapText.=' xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemaps/0.9 http://www.sitemaps.org/schemas/sitemaps/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'; $siteMapText.="n"; $siteMapText.=$urlString; $siteMapText.=""; $fhandle = fopen("sitemap.xml","w"); fwrite($fhandle,$siteMapText); fclose($fhandle); ?>

As the script runs, it will print to standard out the paths of the files that are being spidered and it will also print a warning about any missing webpages. When I run the script against my jezra.net site, the output is:
http://www.jezra.net http://www.jezra.net/home http://www.jezra.net/projects http://www.jezra.net/music http://www.jezra.net/contact http://www.jezra.net/projects/hubcap http://www.jezra.net/projects/serial_switch http://www.jezra.net/projects/svggraph http://www.jezra.net/projects/vplayer

good, no missing files. The sitemap that is generated by the script is as follows:
<?xml version="1.0" encoding="UTF-8" ?> <urlset  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">  <url>   <loc>http://www.jezra.net/home</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/projects</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/music</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/contact</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/hubcap</loc>   <priority>0.4</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/serial_switch</loc>   <priority>0.4</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/svggraph</loc>   <priority>0.4</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/vplayer</loc>   <priority>0.4</priority>  </url> </urlset>

Hey, it gets the job done.
Comments
Name:
not required
Email:
not required (will not be displayed)
Website:
not required (will link your name to your site)
Comment:
required
Please do not post HTML code or bbcode unless you want it to show up as code in your post. (or if you are a blog spammer, in which case, you probably aren't reading this anyway).
Prove you are human by solving a math problem! I'm sorry, but due to an increase of blog spam, I've had to implement a CAPTCHA.
Problem:
2 plus 8
Answer:
required
  • Tags:
  • PHP