2009-01-30
A while ago, I needed to spider all of the pages in a dynamically generated website and do something with the data. Since the spider only needed to work with the one website, I was free to add as much site specific code as I needed. The script has been copied and adjusted to fit the needs of the whatever site I happen to be working on at the time. In a nut shell, the script gets passed a web page address, downloads the page, uses a regex to find all of the links, parses the links, and then recurses through the links.
The current iteration of the script spiders a "url-beautified" site and generates a very basic sitemap suitable for submission to Google or Yahoo, and the current code looks like this:
As the script runs, it will print to standard out the paths of the files that are being spidered and it will also print a warning about any missing webpages. When I run the script against my jezra.net site, the output is:
good, no missing files. The sitemap that is generated by the script is as follows:
Hey, it gets the job done.
The current iteration of the script spiders a "url-beautified" site and generates a very basic sitemap suitable for submission to Google or Yahoo, and the current code looks like this:
#!/usr/bin/env php <?php function getrealPath($baseDir,$link) { $newDir=''; $rise=0; //count the number of ../ in the link for($i=0;$i<strlen($link)-4 ; $i++ ) { if($link[$i]=="." && $link[$i+1]=="." && $link[$i+2]=="/") { $rise++; } } $folders = explode("/",$baseDir); //print_r($folders); for($i=0 ; $i < count($folders)-$rise ; $i++) { $newDir .= $folders[$i]."/"; } $link = str_replace("../","",$link); $link = str_replace(" ","%20",$link); #replace double forward slashes //$link = str_replace("//","/",$link); return $newDir.$link; } function getLinks($file,$parent='') { $dirname=''; global $brokenLinks; global $startTime; global $pageList; global $brokenLinksString; $text = @file_get_contents($file);//get the text from the file if(trim($text)=="")//the file can't be opened { $brokenLinksString.="$parentnt$filenn"; echo "--Missing File--nt$filen"; return 1; } echo "$filen"; $text = str_replace(array("href =","HREF =","href= ","HREF= ","HREF="),"href=",$text); preg_match_all("/href=["|'](.*?)["|']/", $text, $link_results); foreach ($link_results[1] as $link) { $link_is_absolute=false; //clean up ampersands in the link $link = str_replace("&","&",$link); if(stristr($link,"mailto") ) { continue; } $processedLink = strtolower($link); //find out if the link is external before we figure out the realLink if (stristr($processedLink,"http:") || stristr($processedLink,"https:") ) { if(strpos($processedLink,BaseHREF)===0 ) { //this link is and absolute path }else{ //this link is external continue; } } //ignore all extensions if(strstr($processedLink,".") || strstr($processedLink,"mailto:") ) { continue; } //if the file is absolute local we don't need to discover the abspath if($link_is_absolute) { $absLink = $processedLink; }else{ $absLink = getRealPath(BaseHREF."$dirname",$processedLink); } if( !in_array($absLink,$pageList)) { array_push($pageList,$absLink); getLinks($absLink,$file); } } return 1; } function xmlEncode($string) { $newString = str_replace(array("&","'",""",">","<"),array("&","'",""",">","<"),$string); return $newString; } /*begin the script*/ #make some global variables $pageList = array(); $brokenLinksString=""; $brokenLinks = array(); define ("BaseHREF",$argv[1]); getLinks(BaseHREF); //find the links in the file //make the sitemap foreach($pageList as $filePath) { //echo $filePath."n"; //create the url for our Google Site Map $subPath = explode("/",$filePath); $folderDepth = count($subPath)-1; $subPath[$folderDepth] = xmlEncode( $subPath[$folderDepth] ); $encodedPath = implode("/",$subPath); $loc = $filePath; $urlString.= "tn" ; $urlString.= "tt$locn"; if(stristr($filePath,"home.") ) { $priority = 1-($folderDepth*0.1); }else{ $priority = 0.8-($folderDepth*0.1);; } //$priority=( stristr($filePath,"home.") )?0.9:0.5; $urlString.="tt $priorityn"; $urlString.= "tn"; } $siteMapText="n"; $siteMapText.=" ; $siteMapText.=' xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemaps/0.9 http://www.sitemaps.org/schemas/sitemaps/sitemap.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">'; $siteMapText.="n"; $siteMapText.=$urlString; $siteMapText.=""; $fhandle = fopen("sitemap.xml","w"); fwrite($fhandle,$siteMapText); fclose($fhandle); ?>
As the script runs, it will print to standard out the paths of the files that are being spidered and it will also print a warning about any missing webpages. When I run the script against my jezra.net site, the output is:
http://www.jezra.net http://www.jezra.net/home http://www.jezra.net/projects http://www.jezra.net/music http://www.jezra.net/contact http://www.jezra.net/projects/hubcap http://www.jezra.net/projects/serial_switch http://www.jezra.net/projects/svggraph http://www.jezra.net/projects/vplayer
good, no missing files. The sitemap that is generated by the script is as follows:
<?xml version="1.0" encoding="UTF-8" ?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.jezra.net/home</loc> <priority>0.5</priority> </url> <url> <loc>http://www.jezra.net/projects</loc> <priority>0.5</priority> </url> <url> <loc>http://www.jezra.net/music</loc> <priority>0.5</priority> </url> <url> <loc>http://www.jezra.net/contact</loc> <priority>0.5</priority> </url> <url> <loc>http://www.jezra.net/projects/hubcap</loc> <priority>0.4</priority> </url> <url> <loc>http://www.jezra.net/projects/serial_switch</loc> <priority>0.4</priority> </url> <url> <loc>http://www.jezra.net/projects/svggraph</loc> <priority>0.4</priority> </url> <url> <loc>http://www.jezra.net/projects/vplayer</loc> <priority>0.4</priority> </url> </urlset>
Hey, it gets the job done.
Comments