2009-01-30
A while ago, I needed to spider all of the pages in a dynamically generated website and do something with the data. Since the spider only needed to work with the one website, I was free to add as much site specific code as I needed. The script has been copied and adjusted to fit the needs of the whatever site I happen to be working on at the time. In a nut shell, the script gets passed a web page address, downloads the page, uses a regex to find all of the links, parses the links, and then recurses through the links.

The current iteration of the script spiders a "url-beautified" site and generates a very basic sitemap suitable for submission to Google or Yahoo, and the current code looks like this:
#!/usr/bin/env php <?php function getrealPath($baseDir,$link) { $newDir=''; $rise=0; //count the number of ../ in the link for($i=0;$i<strlen($link)-4 ; $i++ ) { if($link[$i]=="." && $link[$i+1]=="." && $link[$i+2]=="/") { $rise++; } } $folders = explode("/",$baseDir); //print_r($folders); for($i=0 ; $i < count($folders)-$rise ; $i++) { $newDir .= $folders[$i]."/"; } $link = str_replace("../","",$link); $link = str_replace(" ","%20",$link); #replace double forward slashes //$link = str_replace("//","/",$link); return $newDir.$link; } function getLinks($file,$parent='') { $dirname=''; global $brokenLinks; global $startTime; global $pageList; global $brokenLinksString; $text = @file_get_contents($file);//get the text from the file if(trim($text)=="")//the file can't be opened { $brokenLinksString.="$parentnt$filenn"; echo "--Missing File--nt$filen"; return 1; } echo "$filen"; $text = str_replace(array("href =","HREF =","href= ","HREF= ","HREF="),"href=",$text); preg_match_all("/href=["|'](.*?)["|']/", $text, $link_results); foreach ($link_results[1] as $link) { $link_is_absolute=false; //clean up ampersands in the link $link = str_replace("&","&",$link); if(stristr($link,"mailto") ) { continue; } $processedLink = strtolower($link); //find out if the link is external before we figure out the realLink if (stristr($processedLink,"http:") || stristr($processedLink,"https:") ) { if(strpos($processedLink,BaseHREF)===0 ) { //this link is and absolute path }else{ //this link is external continue; } } //ignore all extensions if(strstr($processedLink,".") || strstr($processedLink,"mailto:") ) { continue; } //if the file is absolute local we don't need to discover the abspath if($link_is_absolute) { $absLink = $processedLink; }else{ $absLink = getRealPath(BaseHREF."$dirname",$processedLink); } if( !in_array($absLink,$pageList)) { array_push($pageList,$absLink); getLinks($absLink,$file); } } return 1; } function xmlEncode($string) { $newString = str_replace(array("&","'",""",">","<"),array("&","'",""",">","<"),$string); return $newString; } /*begin the script*/ #make some global variables $pageList = array(); $brokenLinksString=""; $brokenLinks = array(); define ("BaseHREF",$argv[1]); getLinks(BaseHREF); //find the links in the file //make the sitemap foreach($pageList as $filePath) { //echo $filePath."n"; //create the url for our Google Site Map $subPath = explode("/",$filePath); $folderDepth = count($subPath)-1; $subPath[$folderDepth] = xmlEncode( $subPath[$folderDepth] ); $encodedPath = implode("/",$subPath); $loc = $filePath; $urlString.= "tn"; $urlString.= "tt$locn"; if(stristr($filePath,"home.") ) { $priority = 1-($folderDepth*0.1); }else{ $priority = 0.8-($folderDepth*0.1);; } //$priority=( stristr($filePath,"home.") )?0.9:0.5; $urlString.="tt$priorityn"; $urlString.= "tn"; } $siteMapText="n"; $siteMapText.="; $siteMapText.=' xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemaps/0.9 http://www.sitemaps.org/schemas/sitemaps/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'; $siteMapText.="n"; $siteMapText.=$urlString; $siteMapText.=""; $fhandle = fopen("sitemap.xml","w"); fwrite($fhandle,$siteMapText); fclose($fhandle); ?>

As the script runs, it will print to standard out the paths of the files that are being spidered and it will also print a warning about any missing webpages. When I run the script against my jezra.net site, the output is:
http://www.jezra.net http://www.jezra.net/home http://www.jezra.net/projects http://www.jezra.net/music http://www.jezra.net/contact http://www.jezra.net/projects/hubcap http://www.jezra.net/projects/serial_switch http://www.jezra.net/projects/svggraph http://www.jezra.net/projects/vplayer

good, no missing files. The sitemap that is generated by the script is as follows:
<?xml version="1.0" encoding="UTF-8" ?> <urlset  xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">  <url>   <loc>http://www.jezra.net/home</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/projects</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/music</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/contact</loc>   <priority>0.5</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/hubcap</loc>   <priority>0.4</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/serial_switch</loc>   <priority>0.4</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/svggraph</loc>   <priority>0.4</priority>  </url>  <url>   <loc>http://www.jezra.net/projects/vplayer</loc>   <priority>0.4</priority>  </url> </urlset>

Hey, it gets the job done.
Comments
Name:
not required
Email:
not required (will not be displayed)
Website:
not required (will link your name to your site)
Comment:
required
Please do not post HTML code or bbcode unless you want it to show up as code in your post. (or if you are a blog spammer, in which case, you probably aren't reading this anyway).
Prove you are human by solving a math problem! I'm sorry, but due to an increase of blog spam, I've had to implement a CAPTCHA.
Problem:
9 minus 4
Answer:
required
  • Tags:
  • PHP
subscribe
 
2016
2015
2014
2013
2012
2011
2010
December
November
October
September
August
July
June
May
April
March
February
January
2009
December
November
October
September
August
July
June
May
April
March
February
January
2008