Jezra.net: itsy bitsy spider

2009-01-30

A while ago, I needed to spider all of the pages in a dynamically generated website and do something with the data. Since the spider only needed to work with the one website, I was free to add as much site specific code as I needed. The script has been copied and adjusted to fit the needs of the whatever site I happen to be working on at the time. In a nut shell, the script gets passed a web page address, downloads the page, uses a regex to find all of the links, parses the links, and then recurses through the links.

The current iteration of the script spiders a "url-beautified" site and generates a very basic sitemap suitable for submission to Google or Yahoo, and the current code looks like this:


#!/usr/bin/env php
<?php

function getrealPath($baseDir,$link)
{
    $newDir='';
    $rise=0;
    //count the number of ../ in the link
    for($i=0;$i<strlen($link)-4 ; $i++ )
    {
        if($link[$i]=="." && $link[$i+1]=="." && $link[$i+2]=="/")
        {
            $rise++;
        }
    }

    $folders = explode("/",$baseDir);
    //print_r($folders);
    for($i=0 ; $i < count($folders)-$rise ; $i++)
    {
        $newDir .= $folders[$i]."/";
    }
    $link = str_replace("../","",$link);
    $link = str_replace(" ","%20",$link);
    #replace double forward slashes
    //$link = str_replace("//","/",$link);
    return $newDir.$link;
}
function getLinks($file,$parent='')
{
    $dirname='';
    global $brokenLinks;
    global $startTime;
    global $pageList;
    global $brokenLinksString;
    $text = @file_get_contents($file);//get the text from the file
    if(trim($text)=="")//the file can't be opened
    {
        $brokenLinksString.="$parentnt$filenn";
        echo "--Missing File--nt$filen";
        return 1;
    }
    echo "$filen";
    $text = str_replace(array("href =","HREF =","href= ","HREF= ","HREF="),"href=",$text);
    preg_match_all("/href=["|'](.*?)["|']/", $text, $link_results);
    foreach ($link_results[1] as $link)
    {
        $link_is_absolute=false;
        //clean up ampersands in the link
        $link = str_replace("&","&",$link);
        if(stristr($link,"mailto") )
        {
            continue;
        }
        $processedLink = strtolower($link);
        //find out if the link is external before we figure out the realLink
        if (stristr($processedLink,"http:") || stristr($processedLink,"https:") )
        {
            if(strpos($processedLink,BaseHREF)===0 )
            {
                //this link is and absolute path
            }else{
                //this link is external
                continue;
            }
        }
        //ignore all extensions
        if(strstr($processedLink,".") || strstr($processedLink,"mailto:") )
        {
            continue;
        }

        //if the file is absolute local we don't need to discover the abspath
        if($link_is_absolute)
        {
        $absLink = $processedLink;
        
        }else{
            $absLink = getRealPath(BaseHREF."$dirname",$processedLink);
        }
        if( !in_array($absLink,$pageList))
        {   
            array_push($pageList,$absLink);
            getLinks($absLink,$file);
        }
    }
    return 1;
}
function xmlEncode($string)
{
    $newString = str_replace(array("&","'",""",">","<"),array("&","'",""",">","<"),$string);
    return $newString;
}

/*begin the script*/
#make some global variables
$pageList = array();
$brokenLinksString="";
$brokenLinks = array();

define ("BaseHREF",$argv[1]);

getLinks(BaseHREF); //find the links in the file

//make the sitemap

foreach($pageList as $filePath)
{
    //echo $filePath."n";
    //create the url for our Google Site Map
    $subPath = explode("/",$filePath);
    $folderDepth = count($subPath)-1;
    $subPath[$folderDepth] = xmlEncode( $subPath[$folderDepth] );
    $encodedPath = implode("/",$subPath);
    $loc = $filePath;
    $urlString.= "tn";
    $urlString.= "tt$locn";

    if(stristr($filePath,"home.") )
    {
        $priority = 1-($folderDepth*0.1);
    }else{
        $priority = 0.8-($folderDepth*0.1);;
    }
    //$priority=( stristr($filePath,"home.") )?0.9:0.5;
    $urlString.="tt$priorityn";
    $urlString.=  "tn";
}

$siteMapText="n";
$siteMapText.=";
$siteMapText.=' xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemaps/0.9
         http://www.sitemaps.org/schemas/sitemaps/sitemap.xsd"
         xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">';
$siteMapText.="n";
$siteMapText.=$urlString;
$siteMapText.="";

$fhandle = fopen("sitemap.xml","w");
fwrite($fhandle,$siteMapText);
fclose($fhandle);

?>

As the script runs, it will print to standard out the paths of the files that are being spidered and it will also print a warning about any missing webpages. When I run the script against my jezra.net site, the output is:

http://www.jezra.net
http://www.jezra.net/home
http://www.jezra.net/projects
http://www.jezra.net/music
http://www.jezra.net/contact
http://www.jezra.net/projects/hubcap
http://www.jezra.net/projects/serial_switch
http://www.jezra.net/projects/svggraph
http://www.jezra.net/projects/vplayer

good, no missing files. The sitemap that is generated by the script is as follows:


<?xml version="1.0" encoding="UTF-8" ?>
<urlset
 xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
  <loc>http://www.jezra.net/home</loc>
  <priority>0.5</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/projects</loc>
  <priority>0.5</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/music</loc>
  <priority>0.5</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/contact</loc>
  <priority>0.5</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/projects/hubcap</loc>
  <priority>0.4</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/projects/serial_switch</loc>
  <priority>0.4</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/projects/svggraph</loc>
  <priority>0.4</priority>
 </url>
 <url>
  <loc>http://www.jezra.net/projects/vplayer</loc>
  <priority>0.4</priority>
 </url>
</urlset>

Hey, it gets the job done.

itsy bitsy spider from jezra.net

Comments

Name:		not required
Email:		not required (will not be displayed)
Website:		not required (will link your name to your site)
Comment:		required Please do not post HTML code or bbcode unless you want it to show up as code in your post. (or if you are a blog spammer, in which case, you probably aren't reading this anyway).
Prove you are human by solving a math problem! I'm sorry, but due to an increase of blog spam, I've had to implement a CAPTCHA.
Problem:	0 plus 8
Answer:		required