#!/usr/local/bin/perl # getcomics 1.10 12/22/97 16:43:46 # This is a script to download my comics daily. Can be put into a # cron job easily since it takes no args. # # Written by Mark Hershberger # Added to by John F. Stoffel # Documentation added by Mark Hershberger =head1 NAME getcomics - retrieve a list of comics from the internet and create an HTML page for them. =head1 SYNOPSIS getcomics =head1 DESCRIPTION To use B, you need a C<$HOME/.comicsrc> file. B will read this file and fetch the comics listed. The format of the file is as follows: options Name: name URL: url Refer: refer Link: link options =head2 Properly Formating the C<.comicsrc> File When specifying a comic, three elements are necesary. They are C, C, and C. The fourth item (C) is availible if you do not wish to link to the URL given in Refer. =over 4 =item name: The name of the comic. Name: Dilbert =item url: Where to find the comic. This can be a pure URL, or a regular expression (regexp[L]), or an index. If it is just a URL, then B goes straight to that URL and gets the comic. In this case, it sets the HTTP-Referrer field to whatever is in the C field. Currently, only URLs beginning with "http:" are supported. URL: http://www.uokhsc.edu/~blundmar/today.gif If it is a regular expression, then B first grabs the page given in the C field and looks at the URLs for the images. If it finds one that matches the regular expression, it will set the HTTP-Referrer field to the page given in the C field and fetch the image. URL: dt[0-9]+.*\.gif If it is an index, then B first grabs the page given in the C field and then just counts the images till it gets to the one given in the index. Once it reaches that image, it will set the HTTP-Referrer field to the page given in the C field and fetch the image. Indexes must begin with the pound (#) sign. URL: #2 =item refer: This field is used to "fool" the remote HTTP server. Some will only serve up a page if the request has a HTTP-Referrer field saying that you came from a certain page. It is also used when for instances where the information given in the URL field is not a a pure URL. If the C field is not supplied, then the C field is the link for the source of the comic in the produced HTML page. Refer: http://www.unitedmedia.com/comics/dilbert/ =item link: (Optional) When creating a page, B puts the title linked to the source before each comic. If the value given in the C field is not appropriate, then you can add a C field. Link: http://www.startext.net/today/news/life/laughter/ =item creds: (Optional) Some comics may need a username and password combination to access them. The C option allows you to specify your username and password. The user name and password are given in the form "username/password". (See C for a more general solution.) I Credentials remain set until you reset them. For example, in the following example, both Goofy and Mikey have the same creds. Creds: user/password Name: Goofy Link: http://www.startext.net/today/news/life/laughter/ URL: #2 Name: Mikey Link: http://www.startext.net/today/news/life/laughter/ URL: #2 =back =head2 Authenticating in Realms Sometimes a password is needed to access a comic strip. Generally, more than one comic can be retrieved for a given username/password pair. For any website that uses username/password authentication. =over 4 =item realm: The realm command first gives a quoted realm name, a comma, and then a username/password pair. In determining which password to use, b first looks to see if it has a password for the realm. If it does not and the c parameter has been used, then the c username/password combonation is used. If no c parameter has been given and there is no matching realm, then the comic cannot be retrieved. realm: "HCI Registration Wall", user/pass =back =head2 Cookie Authentication If a site requires certain cookies to access the comics (e.g. a username and password), then b will check for a Netscape cookie file and use the cookies there. For this to work, you must have previously visited the site and had Netscape save cookies from that site. =head2 Variable Information for the C<.comicsrc> File b does minimal variable substitution on each line of the C<.comicsrc> file. If it finds any of the following "variables," it will substitute proper information. =over 4 =item $year4digit The current year in a YYYY format. Example: 1997 =item $year2digit The current year in a YY format. Example: 97 =item $month2digit The current month in DD format. Example: 03 =item $month3alpha The three letter abreviation for the current month. Example: Mar =item $day2digit The current day in 2 digit format. Example: 28 =item $day3alpha The current day of the week. Example: Tue =back =head2 Options =over 4 =item background: The image used as the background. (Default: white.jpeg) Background: white.gif =item cookiejar: The file to be used for a storing cookies. (Default: $ENV{HOME}/.netscape/cookies) CookieJar: ~/.comiccookies =item debug: Boolean directive. Display debugging info. (Default: false) Debug: true =item footer: The text (html code okay) to place at the end of the page. (Default: none) Footer: A footer here =item ident: The identity to give to the Server as our name. (Default: ComixBot) Ident: Mozilla/3.01 (X11; I; SunOS 5.5 sun4m) =item imageDir: The absolute pathname to the directory to put the comic images in. (Default: C<$HOME/html/images>) ImageDir: /WWW/Mydir/images =item imageURL: The path to use in the constructed URL (usually relative). (Default: C) ImageUrl: comics =item linkImage: Boolean directive. Link the image of the comic to it\'s source. This is handy if, for whatever reason, you need to download the image. (Default: false) LinkImage: true =item new: The image placed next to the comic to designate that it has been updated since the comics were last downloaded. (Default: new.jpeg) New: new.gif =item newOnly: Boolean directive. Display only the new comics. Useful for displaying comics that are updated weekly or sporadically. NewOnly: true =item noError: Boolean directive. Don\'t display comics with errors. Usually, if an error occurs while getting a comic, the error is displayed instead of the comic. This directive allows you to get rid of error messages. NoError: true =item page: The path to the html page that will hold the comics. (Default: C<$HOME/html/comics.html>) Page: /WWW/MyDir/comics.html =item quiet: Boolean directive. Stops all messages except for errors. (Default: false) Quiet: true =item toc: Boolean directive. Display a hyperlinked Table of Contents at the beginning of the comics page. (Default: true) TOC: true =back =head1 SEE ALSO An excellent source of online comics is http://studentweb.tulane.edu/~jseifert/comics/ You must have the following perl modules installed to use this script: LWP MD5 These can be found on CPAN (http://www.perl.com/perl/CPAN/). =head1 AUTHORS Mark Hershberger wrote the inital version of this script. He then gave it to several people, one of whom was John F. Stoffel . John made several nifty changes (the C<.comicsrc> file was one) and gave Mark some ideas for further devlopment. Mark did some more work on it, and this is the result. All comments, criticisms, and suggestions should be sent to Mark Hershberger . =cut require 5.002; use strict; use HTML::Parse; use HTML::Element; use HTTP::Request; require HTTP::Date; use LWP::UserAgent; use URI::URL; require POSIX; require MD5; # Comics RC file my $comicsrc = "$ENV{HOME}/.comicsrc"; ############################################################################ # Configurable variables # Over-ride these in .comicsrc ############################################################################ # Where we want the page written to (full path needed). my $page = "$ENV{HOME}/html/comics.html"; # Where we physically store the images (full path needed). my $imagedir = "$ENV{HOME}/html/images"; # Where we refer to them in storage (relative url to $page above). my $imageurl = "images"; # Customized footer at bottom of page. Perfect for back links, etc. my $footer; # Want a Table of Contents? my $hastoc=1; # Want links on images? (makes downloading in Lynx easy) my $linkimage=0; # Are we debuging? 0 means no. my $debug=0; # Should we keep quiet? 0 mean no. my $quiet=0; ############################################################################ # don't touch below here ############################################################################ # Idea of RequestAgent is cut-n-paste from lwp-request # # We make our own specialization of LWP::UserAgent that asks for # user/password if document is protected. { package RequestAgent; use vars qw(@ISA); @ISA = qw(LWP::UserAgent); my @comic_creds = (undef, undef); my %comic_realm; sub new { my $self = LWP::UserAgent::new(@_); $self; } sub register_realm { my $self = shift; my $realm = shift; my $pass = shift; $comic_realm{$realm} = $pass; } # A hack ... all my very own. sub set_password { my $self = shift; @comic_creds = split('/', $_[0], 2); } sub get_basic_credentials { my $self = shift; my $realm = $_[0]; if(defined $comic_realm{$realm}) { return split('/', $comic_realm{$realm}, 2); } else { return @comic_creds; } } } # initialize a global $HTML::Parse::IGNORE_TEXT = "true"; my $i; # counter my $rc; # the response code. my $thiscomic; # the url of the current comic. my $ua = new RequestAgent; # the object that actually fetches our info my $req; # an http request my $parsed; # the parsed html my $linkelem; # a link object my $count; # we use this to keep count of stuff # where cookies are stored. my $cookiefile = "$ENV{HOME}/.netscape/cookies"; my $cookiejar; # object for cookies my $base_url; # the base url for a particular page my $version = "0.5"; # Version of this code my $VERSION = $version; my @Comics; # array of comic records my $comic; # comic record my $title; # title of comic my $url; # url of comic my $refer; # referer of comic my $link; # link to comic's page (optional my $creds; # username/password combo my $imagename; # name of image file, based on title my $today; # today's date for title my $background="white.jpeg"; # name of pic for background my $new="new.jpeg"; # name of pic to mark new comics my $newonly; # display only new comics? my $isnew; # status of this comic. my $noerror; # display errors? my @page; # where we store the page temporarily my @toc; # where we store the toc temporarily # What we call ourself my $name = "ComixBot/$VERSION"; my(@ltime, $year4digit, $year2digit, $month2digit, $month3alpha, $day2digit, $day3alpha, $lasttime, $filehash, $context, $newhash, $yesterday); # Get today's date @ltime = localtime(); $today = POSIX::strftime('%A %B %d, %Y', @ltime); $yesterday = POSIX::strftime('%A.%B.%d.%Y', localtime(time - 24*3600)); $year4digit = POSIX::strftime('%Y', @ltime); $year2digit = POSIX::strftime('%y', @ltime); $month2digit = POSIX::strftime('%m', @ltime); $month3alpha = POSIX::strftime('%b', @ltime); $day2digit = POSIX::strftime('%d', @ltime); $day3alpha = POSIX::strftime('%a', @ltime); # Read in the list of comics to get $comicsrc = glob($comicsrc); open(RC, "<$comicsrc") || die "Error: Can't read $comicsrc: $!\n"; while () { # ignore comments next if /^\s*#/; # insert Date Info s:\$year4digit:$year4digit:g; s:\$year2digit:$year2digit:g; s:\$month2digit:$month2digit:g; s:\$month3alpha:$month3alpha:g; s:\$day2digit:$day2digit:g; s:\$day3alpha:$day3alpha:g; if (/^Name:\s+(.+)$/i) { $title = $1; next; } # get title if (/^Url:\s+(.+)$/i) { $url = $1; next; } # get URL if (/^Refer:\s+(.+)$/i) { $refer = $1; next; } # Get referer if (/^Link:\s+(.+)$/i) { $link = $1; next; } # Get link if (/^Creds:\s+(.+)$/i) { $creds = $1; next; } # Get creds if (/^Realm:\s+\"(.+)\",\s*(.*)$/i) { $ua->register_realm($1, $2); } if (/^Ident:\s+(.+)$/i) { $name = $1; next; } # hide our identity if (/^Page:\s+(.+)$/i) { $page = $1; next; } # get the path for the page if (/^Background:\s+(.+)$/i) # background image { $background = $1; next; } if (/^CookieJar:\s+(.+)$/i) # cookie file { $cookiefile = $1; next; } if (/^New:\s+(.+)$/i) { $new = $1; next; } # "new" image if (/^Footer:\s+(.+)$/i) # the footer { $footer = $1; next; } if (/^ImageDir:\s+(.+)$/i) # the image directory { $imagedir = $1; next; } if (/^ImageUrl:\s+(.+)$/i) # the image url { $imageurl = $1; next; } if (/^Quiet:\s+(.+)$/i) { # Are we gonna be quiet? if($1=~/true/i){ $quiet = 1; } else { $quiet = 0; } next; } if (/^Debug:\s+(.+)$/i) { # Debugging? if($1=~/true/i){ $debug = 1; } else { $debug = 0; } next; } if (/^Toc:\s+(.+)$/i) { # Want a table of contents? if($1=~/true/i){ $hastoc = 1; } else { $hastoc = 0; } next; } if (/^LinkImage:\s+(.+)$/i) { # link on the image to download? if($1=~/true/i){ $linkimage = 1; } else { $linkimage = 0; } next; } if (/^NewOnly:\s+(.+)$/i) { # Display only new comics? if($1=~/true/i){ $newonly = 1; } else { $newonly = 0; } next; } if (/^NoError:\s+(.+)$/i) { # Display errors? if($1=~/true/i){ $noerror = 1; } else { $noerror = 0; } next; } } continue { # This was in the above section inside an if(blankline) block. # I've moved it here because it didn't work for the last comic # if no blankline followed. if ($title and $url and ($refer or $link)) { # do we already have a title? $comic = { 'TITLE' => $title, # yup, we got a record. 'URL' => $url, 'REFER' => $refer, 'LINK' => $link, 'CREDS' => $creds, }; } if((/^\s*$/ or /^\s*#.*$/) and $comic){ push(@Comics,$comic); # add record to array # Clean start for next one. $title = undef; $url = undef; $refer = undef; $link = undef; $comic = undef; } } close(RC); # Do some globbing... $page = glob($page); $imagedir = glob($imagedir); $cookiefile = glob($cookiefile); warn "Checking for cookies in $cookiefile: " if $debug; # Cookies implemented. if (-f $cookiefile) { warn "found\n" if $debug; use HTTP::Cookies; $cookiejar = HTTP::Cookies::Netscape->new(File => "$cookiefile", AutoSave => 0, ); $cookiejar->load(); } else { warn "not found\n" if $debug; } # check the $imagedir exists, create if not if (!-d $imagedir) { `mkdir $imagedir`; print "Created: $imagedir\n" unless $quiet; } # Print out the records if we're debugging if ($debug) { for $comic (@Comics) { warn "Comic->title: $comic->{'TITLE'}\n"; warn "Comic->url: $comic->{'URL'}\n"; warn "Comic->creds: $comic->{'CREDS'}\n"; warn "Comic->refer: $comic->{'REFER'}\n"; warn "Comic->link: $comic->{'LINK'}\n\n"; } } # Turn on Autoflush. $| = 1; # Geronimo! TOPLOOP: foreach $comic (@Comics){ # Tell 'em what we're getting print "Getting ", $comic->{'TITLE'}, "... " unless $quiet; # set the creds (if we don't have any, than undef is given) warn "Setting password: $comic->{'CREDS'}" if $debug; $ua->set_password($comic->{'CREDS'}); # Is it just a plain ole url? if($comic->{'URL'} =~ /^http:/i){ # if it is, use it. $thiscomic = $comic->{'URL'}; } else { # otherwise, find the url # First, get the refering document $rc = $ua->request(new HTTP::Request 'GET', $comic->{'REFER'}); $count = 0; while( ($rc->code() != 200) && $count < 10){ print "." unless $quiet; $rc = $ua->request(new HTTP::Request 'GET', $comic->{'REFER'}); $count = $count + 1; } if($count == 10 && !$rc->is_success){ print "Can't Connect\n" unless $quiet; next TOPLOOP; } # Parse out the document $parsed = parse_html($rc->content()); # Get the base url $base_url = $rc->base; # Find the img tags $count = 0; for(@{ $parsed->extract_links(qw(img)) }) { ($link, $linkelem) = @$_; # if this is a relative position tag... if($comic->{'URL'} =~ /^#/){ # increase the counter $count = $count + 1; # see if we are where we want to be: if($count == substr($comic->{'URL'}, 1)){ # save the location $thiscomic = url($link)->abs($base_url)->as_string; } } elsif($link =~ /$comic->{'URL'}/){ $thiscomic = url($link)->abs($base_url)->as_string; } } $parsed->delete; } # the filename is now based on the title of the comic. $imagename = $comic->{'TITLE'}; $imagename =~ s/[ .-]/_/g; # change to underscores $imagename =~ s/__+/_/g; # get rid of multiples $imagename =~ s/_+$//g; # trash any underscores at the end. # the set up $req = new HTTP::Request 'GET', $thiscomic; $req->push_header('Referer', $comic->{'REFER'}); if(defined($cookiejar)){ $cookiejar->add_cookie_header($req); } # Go get it, bucko! $rc = $ua->request($req); # Retry until we get it if we get a User Agent Timeout. while ($rc->code == 408){ print "." unless $quiet; $rc = $ua->request($req); } # Tell them what the result was print $rc->code() , " " , $rc->message , "\n" unless $quiet; # Bail out if we got an error if($rc->code() != 200 and $noerror){ warn "Got an error -- bailing\n" if $debug; next TOPLOOP; } # if we worked, then (200 = Ok) if($rc->code() == 200){ # See if it already exists if(-e "$imagedir/$imagename"){ warn "Old comic exists\n" if $debug; # Get the MD5 fingerprint of the file. open(IMAGE, "<$imagedir/$imagename") || die "Can't open $imagedir/$imagename\n"; $context = new MD5; $context->MD5::reset(); $context->MD5::add(); $filehash = $context->hexdigest(); close(IMAGE); # Get the MD5 fingerprint of the new picture open(TEMP, ">/tmp/getcomics.$$") || die "Can't open tempfile\n"; print TEMP $rc->content(); close(TEMP); open(TEMP, "MD5::reset(); $context->MD5::add(); $newhash = $context->hexdigest(); close(TEMP); unlink "/tmp/getcomics.$$"; # See if the old info is different from the new. if( $newhash ne $filehash ) { # Save it if it is different open(IMAGE, ">$imagedir/$imagename") || die "Can't open $imagedir/$imagename\n"; print IMAGE $rc->content(); close(IMAGE); warn "OverWrote: $imagedir/$imagename\n" if $debug; $isnew = 1; } else { warn "Not a new comic\n" if $debug; $isnew = 0; } } else { # Save it if it didn't previously exist. open(IMAGE, ">$imagedir/$imagename"); print IMAGE $rc->content(); close(IMAGE); warn "Wrote new comic: $imagedir/$imagename\n" if $debug; $isnew = 1; } } if(!$newonly or ($isnew and $newonly)){ # Print our divider push @page, "
\n"; if($isnew and !$newonly){ # Show "New" in the heading push @page, " \"New!\"\n"; } # put it in the TOC push @toc, "{TITLE}\">$comic->{TITLE}
"; # Show the title & link push @page, " {TITLE}\" "; if($comic->{'LINK'} eq ''){ push @page, "href=\"$comic->{REFER}\">"; } else { push @page, "href=\"$comic->{LINK}\">"; } push @page, "$comic->{TITLE}

\n"; # if it was found, if($rc->code() == 200) { # Put the image in. if($linkimage){ push @page, " \n"; } else { push @page, " \n"; } } else { push @page, "

Error: ", $rc->code(), " ", $rc->message, "

\n"; } } # if $newonly... } # TOPLOOP open(PAGE, ">$page"); # Print out the header. print PAGE <<"EOH"; Today\'s Comics - $today

Comics for Today - $today

EOH # Print Table of Contents, if needed. print PAGE join("\n",@toc) if $hastoc; # Print out the page print PAGE join(" ",@page); # Print the footer print PAGE < $footer EOF # bye-bye. close(PAGE);