Graphics Muse

Cataloging and Clipping - gathering online data (continued...)

The Template File

After you've got NewsClipper installed (there are a lot of prerequisites, but if you follow the order in the README backwards - from bottom to top - you'll have them installed in no time), you're ready to give it a whirl. Included in the Open Source version is a template file. This file contains some samples of NewsClipper acquisition, general and output filters. You can actually try it out without modifying it. If you have problems you can comment out the NewsClipper commands for all but one filter by changing lines like this

<!-- newsclipper

<!-- Xnewsclipper

The format of a template file is fairly simple:

...HTML formatting ....

...more HTML formatting...

The first line is just the tag telling NewsClipper that the following filters should be processed. If you change the name from newsclipper to Xnewsclipper then NewsClipper will ignore this and simply copy the entire comment to the HTML output file. Since it's wrapped in a comment, the three filter lines will be ignored by the browser.

The input line tells NewsClipper which acquisition filter to run and what parameters to pass to it. Acquisition filters can have as many parameters as they want, but in practice most of the ones provided with the distribution had few, if any.

The filter line allows you to write a processor for the output from the acquisition filter. Acquisition filters will return data as a string, an array or as a hash. There are stock filters for converting hashes to strings or arrays and examples of using these are given in the template file. Although it seemed at first this was where I'd be making modifications or writing new filters, it turned out that I was able to use only stock general handler here. Most of my changes happened with the acquisition handlers.

Acquisition Handlers

Nearly all of the acquisition handlers provided in the distribution make use of a few subroutines provided in the NewsClipper perl modules: GetURL, GetLinks, GetText, GetHtml, and so forth. These subroutines make it very easy to grab the entire site or just portions of it. GetURL will grab the entire site and return it as a string. GetText and GetHTML do similar things, but filter out parts of the page. GetLinks is used to retrieve just the HREF links on the page.

Once the page or portions of it have been retrieved, the acquisition filters can do a little more processing on the data. Some of the handlers will break the data into hashes, perl lists of name/value pairs. When this is returned back to NewsClipper, it can be passed to the map general filter, which then passes it to the hash2string filter for HTML formatting. I'll show how this works in an example in a moment.

General Handlers

Once the acquisition handler completes, it returns data to NewsClipper who then passes it to any general filters. In practice I've found the best use of general filters is in formatting the HTML output using the hashes returned from the acquisition filters. But they can be used for just about anything you can think of when processing the strings, arrays and hashes that acquisition filters will pass to you.

Output Handlers

These handlers, like general handlers, seem best suited for formatting. One of the most common uses is to place arrays into multicolumn lists. In practice, I used these only to the formatted HTML from my general handler as a string. In this way I was insured that no further processing was performed on my already formatted HTML by another output handler.

Adding a database - mSQL

All handlers are stored whatever directories (there can be more than one) are listed in the NewsClipper.cfg configuration file. By default, this is installed under $HOME/.NewsClipper/NewsClipper.cfg. Whenever NewsClipper automatically downloads an updated filter, it will place it in the first directory specified by the handlerlocations variable. When I began to modify existing acquisition filters I copied them to similarly named files - such as from cola.pm to colagm.pm - in the same directory where the original resided.

Once I had a copy of the original, I added the following line to add mSQL access from within the file:

use Msql;

This was added right after the "use strict;" line. Using the Msql.pm perl module's interface, I could then access the database from the acquisition script. Note: it's important that the person or script that runs the NewsClipper.pl script, which will runs the acquisition filter and accesses the mSQL database, is someone that has read and probably write access to the mSQL databases.

The next step was to modify the acquisition handler to parse the data (if it wasn't already) for adding to the database. The best way to explain this is to show an example.

Example - grabbing and logging submissions to comp.os.linux.announce

First, take a look at the template file for grabbing recent submissions to comp.os.linux.announce:


<p>
<table width=100% border=0 cellpadding=2 cellspacing=0 NOSAVE>
<tr>
<td colspan=2 ALIGN=LEFT bgcolor="#00f000"><font size=4><i>c.o.l.a</i></font></td>
<td ALIGN=LEFT bgcolor="#00f000">
   <font size=2>Select All:
        Yes<input type=radio value=yes name=gmcolaall>
        No<input type=radio value=no name=gmcolaall CHECKED>
   </font>
   </td>
</tr>
<tr>
<td bgcolor="#00f000" ALIGN=CENTER VALIGN=TOP>Keep</td>
<td bgcolor="#00f000" ALIGN=CENTER VALIGN=TOP>Drop</td>
<td bgcolor="#00f000" ALIGN=LEFT VALIGN=TOP>Title</td>
</tr>

</table>
</p>

Here you can see the NewsClipper commands are embedded within a table. For each hash returned by the colagm acquisition filter, the hash2string general handler is called. It then formats some HTML and fills in any variables, shown as %{var}, with the value of the hash with that name. So the variable %{url} gets replaced with the value from the hash with the name url.

The colagm.pm acquisition handler looks like this (my modification are shown in red:

# -*- mode: Perl; -*-
# AUTHOR: John Goerzen
# EMAIL: jgoerzen@complete.org
# ONE LINE DESCRIPTION: Latest messages from comp.os.linux.announce
# URL: https://www.cs.helsinki.fi/~mjrauhal/linux/cola.archive/*.html
# TAG SYNTAX:
# <input name=colagm department=X>
#   Returns an array of links
# X: One of#     last50   - search cola-last-50.html for entries.
#     sorted   - search cola-sorted.html for entries.
#     www      - search cola-www.html for entries.
# LICENSE: GPL
# NOTES:
package NewsClipper::Handler::Acquisition::colagm;
use strict;
use Msql;
use NewsClipper::Handler;
use NewsClipper::Types;
use vars qw( @ISA $VERSION );
@ISA = qw(NewsClipper::Handler);
# DEBUG for this package is the same as the main.
use constant DEBUG => main::DEBUG;
use NewsClipper::AcquisitionFunctions qw( &GetLinks );
$VERSION = 0.3;
# ------------------------------------------------------------------------------
sub gettime
{
   my $sec;
   my $min;
   my $hour;
   my $mday;
   my $mon;
   my $year;
   my $wday;
   my $yday;
   my $isdist;
   my $datestring;
   ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdist) = localtime(time);
   $datestring = $year + 1900;
   $datestring *= 10000;
   $datestring += $mon*100 + $mday;
   return $datestring;
}
# This function is used to get the raw data from the URL.
sub Get
{
   my $self = shift;
   my $attributes = shift;
   my $url;
   my $data;
   my $colafile;
   my $start_delimiter;
   my $end_delimiter;
   my $urllink;
   my @results;
   my $tempRef;
   my $query_line;
   my @query_lines;
   my $sth;
   my $newcount;
   my $dbdate;
   my $title;
   $attributes->{department} = "last50"
      unless defined $attributes->{department};
   #
   # Determine which file to search.
   #
   if ( "$attributes->{department}" eq "last50" )
   {
      $colafile = "cola-last-50.html";
      $start_delimiter = "newest ones first";
      $end_delimiter = "Last modified";
   }
   elsif ( "$attributes->{department}" eq "sorted" )
   {
      $colafile = "cola-sorted.html";
      $start_delimiter = "order by the subject";
      $end_delimiter = "Last modified";
   }
   else
   {
      $colafile = "cola-www.html";
      $start_delimiter = "the last one first";
      $end_delimiter = "Last modified";
   }
   #
   # Build the URL which is to be queried.
   #
   $url = join("",
      "https://www.cs.helsinki.fi/~mjrauhal/linux/cola.archive/",
      $colafile);
   #
   # Now run off and get those links!
   #
   $data = &GetLinks($url, $start_delimiter, $end_delimiter);
   return undef unless defined $data;
   #
   # Weed out any User Group messages
   #
   @$data = grep {!/(LOCAL:)/} @$data;
   # Open the Msql connections and select the databases of interest.
   my $dbh1 = Msql->connect();
   $dbh1->selectdb('gm-news');
   # Clear the "new article" table - if we haven't processed those
   # entries yet, then we'll see them again anyway.
   $query_line = join("", "DELETE FROM new_cola");
   $sth = $dbh1->query($query_line);
   $newcount = 1;
   #
   # Now run through the list to find only the news ones. Then add these
   # to the proper database.
   #
   while (@{$data})
   {
      $_= shift @{$data};
      # Escape single quotes. We take them out later when we display them, if
      # necessary.
      $title = $_;
      $title =~ s/'/\\'/g;
      # Query the Accepted table for this article name.
      $query_line =
         join("", "SELECT title FROM accepted WHERE title = '", $title, "'");
      $sth = $dbh1->query($query_line);
      if ( $sth->numrows > 0 )
      {
         next;
      }
      # Query the Rejected table for this article name.
      $query_line =
         join("", "SELECT title FROM rejected WHERE title = '", $title, "'");
      $sth = $dbh1->query($query_line);
      if ( $sth->numrows > 0 )
      {
         next;
      }
     # Article has not been seen previously. Add it to the new database.
      $dbdate = gettime();
      $query_lines[0] = "INSERT INTO new_cola VALUES (";
      $query_lines[1] = $newcount;
      $query_lines[2] = ", ";
      $query_lines[3] = $dbdate;
      $query_lines[4] = ", '";
      $query_lines[5] = $title;
      $query_lines[6] = "', '";
      $query_lines[7] = "cola";
      $query_lines[8] = "', '";
      $query_lines[9] = " ";
      $query_lines[10] = "', '";
      $query_lines[11] = " ";
      $query_lines[12] = "', '";
      $query_lines[13] = " ";
      $query_lines[14] = "')";
      $query_line = join('', @query_lines);
      $sth = $dbh1->query($query_line);
      # Abort on errors encountered while inserting into the new article table.
      if ( length(Msql->errmsg) > 0 )
      {
         print "\n" and return undef;
      }
      # Everything went ok, and it's a new article. Save it for return to the
      # caller.
      push @results,
      {
         index    => $newcount,
         url      => $_
      };
      $newcount++;
   }
   $tempRef = \@results;
   MakeSubtype('ArrayOfCOLAHash','ArrayOfHash');
   bless $tempRef,'ArrayOfCOLAHash';
   return $tempRef;
}
# ------------------------------------------------------------------------------
sub GetDefaultHandlers
{
my $self = shift;
my $inputAttributes = shift;
my @returnVal;
my @returnVal = (
     {'name' => 'limit','number' => '10'},
     {'name' => 'array'},
);
return @returnVal;
}
1;

One thing I didn't show here was that I added some new parameters to the handler so that you can grab different c.o.l.a. archives. You can grab a copy of the source if you want to try it out yourself, but you'll need to create the proper databases too.

The gettime subroutine is just something I added to format a date stamp for the database. There may be easier ways to do this - I'm not the worlds best perl programming.

The rest of the changes, near the center to end of the Get subroutine, are used to parse the returned links and check if they already exist in one of two tables. If not, the new link is added to a third table and added to the outgoing hash. It may look a little complex if you're not familiar with SQL syntax, but really there isn't much too this. In this case, the returned values from GetLinkx() is just a list of links, which makes processing the site's data pretty easy. In other cases I had to break apart HTML line by line, searching for key words, then stripping out extra tags and HTML to get at the text and/or links of interest. The thing is, since this is all done in Perl, and Perl is great for parsing text, this all really wasn't too difficult.

What all this gets me is a table of new entries from the c.ol.a. archives in a table, from which I select the articles of interest. The template file produces an HTML page with a form which gets submitted to a CGI script to move new articles into either the accepted or rejected database. Later, I can write scripts for producing web pages with the accepted or rejected entries. And any future runs of NewsClipper on this template file will only produce new entries from c.o.l.a! Pretty nifty.

Writing your own acquisition filter - MakeHandler.pl

There is a perl script included with the NewsClipper distribution called MakeHandler.pl that assists you in writing your own acquisition handlers from scratch. Although I've heard it's quite useful and very easy to use, I've never used it myself. All of the sites I was interested in (at least so far) have handlers written for them already, so I just had to modify them for working with my mSQL database.

Supported Linux sites

The default handlers include support for downloading Slashdot, Freshmeat, LinuxToday, Linux Daily News, and c.o.l.a, along with many, many others. Interestingly enough, the Slashdot handler grabs the main page instead of the backend pages, apparently because the author was worried that the backend was not kept up to date.

Although I didn't see one, I suspect using NewsClipper to download the Freshmeat database file would also be possible. Since it's a simple text file with a common format for each entry it should be pretty easy to parse.

Caveats

One of the limitations of NewsClipper appears to be that it doesn't like sites that don't have closing tags in all the right places. For example, did you know that paragraph tags, <p>, have closing tags, </p>? Without them, it's possible NewsClipper's acquisition filters can get confused. You might be able to process these sites using GetURL() and parsing the pages manually, but you'll be happier if you can just find sites that do the right thing. Interestingly enough, I've discovered that my pages here in the Muse are not right - Netscape's Composer doesn't add that closing </p> tag.