
sitescooper Change Log:
-----------------------

(Please note: versions marked DEVEL have not been released yet!)

2.2.8  May 18 2000 jm 

Robb Canfield: fixes to the Table module: A tempory debug file t.txt was
being written. bug in force_headers caused all rows after the header count
reached -1 to become HEADER rows. Stefan Schwingeler provided ntfaq.site,
new de_spiegel.site, and betanews.site. Marko Bozikovic <redbyron /at/
fly.srk.fer.hr> provided calvin_and_hobbes.site. Derek Glidden <dglidden
/at/ illusionary.com>: fixed crypto-gram.site, hotair_features.site,
i_cringley.site, kernel_traffic.site, sendmail.net.site; added
science_daily.site, spaceref.site. Alastair Rankine provided
daemonnews.site for some much-needed BSD support ;)  Redirected links are
now rewritten in the output file.  Mac support fixed, thanks to Vincent
Oberson's beta-testing skills.  Fixed obscure bug with StoryURL not being
inherited from a Layout, demonstrated by the BBC news sites. Sites with
all levels specified with the same URL pattern now work as expected.
Wrote exhaustive test for redirection handling, redirect.t, so those
problems should clear up quite well. NY Times fixed. URLProcess bug fixed
(nothing was being run). A fine selection of sites from Justin Henry:
updated salon.site; gist_tv.site; cats_cradle.site; clark_howard.site;
morbid_fact_du_jour.site; news_observer.site; ny_times_handheld.site;
roger_ebert.site; usa_today.site; weather24.site, and wral_tv.site.
Finally took the plunge and created an "odd" category for vaguely
wierd/fortean/morbid sites ;)


2.2.7  Apr  7 2000 jm 

Added -set parameter, and cleaned up command-line-driven scooping in
general.  Fixed bug in -html etc., now they print out the path to the
output file before exiting. "using already-set password for ..." warning
now only output once per realm. Took a good look at the final install
steps, and rewrote to be more logical and output the correct path to the
.prc file.  Added -disc argument, to disconnect after scooping;
experimental. Removed File::Spec, as it wasn't available on the OS it was
included for! (MacOS) added BASE HREF tag support.  LeMonde_AutoMoto.site,
sia_fr.site from Michael Memeteau.


2.2.6  Mar 24 2000 jm 

sitescooper initialization fix from Alexander Skwar <askwar /at/
digitalprojects.com>; config file was being "copied" from wrong location
if sitescooper is installed in non-default loc. Makefile bug
(/usr/share/sitescooper/sitescooper) noted by Alexander Skwar and Richard
Cohen, thanks guys ;)  TableRender added... see weather/wu_redmond.site
for an example. Not sure if this should be the default weather rendering
or not... -name arg added. now searches for modules and site_samples using
File::Spec to support MacOS. Robb Canfield <robbc /at/ canfield.com>:
donated Exten::Table.pm, a *COOL* module which allows tables to be
reformatted for better display on a Palm.


2.2.5  Mar 22 2000 jm 

KPilot support changed to not require that the pilotListener proc be
running at the same time for KPilot to be used; csmonitor.site &
iol_africa_south.site added -- all thx to Andy Rabagliati.  Bug in
image-scaling with multipage output fixed. jerusalem_post.site added thx
to Richard Cohen.  Default size limit upped to 300k. "Cannot find
SitesDir" rpm/deb bug fixed!  No longer upgrades cache directories as this
should be well obsolete by now.  Fixed bugs in reading configuration
files.


2.2.4	Mar 14 2000 jm 

Robb Canfield <robbc /at/ canfield.com>: fix for bug where config was not
being processed at all, and a fix for diff-checking on windows, which did
an unnecessary check when the diff module was used. Thx!  Now doesn't edit
site_choices.txt if the -site or -sites args are used.  Tracked down a bug
which caused sitescooper to not read ANY sites unless the SitesDir was set
and available.  Added code from Kevin L. Dupree <kdupree /at/ flash.net>
to support scooping image-only sites: see the new sites in the "comics"
section to check this out.  Now includes subs-to-site.pl, don't know how I
forgot that. michael d. ivey patch: use $ENV{'VISUAL'} and /usr/bin/editor
for Debian std compliance. Updated site files for NY Times from Kennis
Koldewyn. Washtech sites from <MMiller /at/ media-general.com>. KPilot
support thanks to Andy Rabagliati <andyr /at/ wizzy.com>.


2.2.3	Mar  6 2000 jm 

Quick fix; rpms were being built incorrectly.


2.2.2	Mar  6 2000 jm 

.deb files now available, thanks to michael d. ivey -- thanks!  Alastair
Rankine pointed out that Win32 perl cannot seem to run a command when the
command binary is named "in quotes" (argh). Jacques Turb: new site for Le
Soir (regional_belgium).  Distros now include PDA::Pilot::Install, renamed
to PDA::PilotInstall to (hopefully) fit better in CPAN.  Most of the
sitescooper main logic moved to Scoop.pm.


2.2.1	Feb 21 2000 jm

Oops -- forgot 2 site files from Richard Cohen for The Times (UK).  Now
included. Also updated site files: Le Monde (full) thanks to Jacques
Turb, and an update for Le Journal du Net and a new one for Multimdium
by Philippe Renard. Carsten Clasohm updated de_tvspielfilm.site and
provided a new freshmeat site with improved contents page.  Changed
NY_Times sites with a fix suggested by Marc Vaillant <marc /at/ jhu.edu>.
Use of PDA::Pilot::Install module to install PRC files now the default on
Windows and UNIX. Renamed all scripts back to .pl, as the .plx extension
seems to have other meanings on Win32 with the ActiveState perl - bad!


2.2.0	Feb 16 2000 jm

Now using the modularised, parallelised code. Use "-parallel" switch to
enable parallel mode.  Configuration now separated entirely from the
script.  Added "palm_boulevard.site" and tip for finding more AvantGo
sites in the writing_site.html documentation. Added meta-tag refresh check
to ensure that long-term refreshes such as ArsTechnica's "999;
http://www.arstechnica.com/" are not followed. "UseTableSmarts: 0" was not
working. Added Car and Truck News, Gabriel's Mobile Channel, msex, rhat in
stocks cat, wu_new_mexico in weather cat, all thanks to Joe Pfeiffer.
Added AuthorName, AuthorEmail headers for future compatibility.  New site:
le_temps.site from Vince. Fix for (no text to write) when the text started
with a quote char, thx to Dave Collins.  Added MAC_ARGS variable so Mac
users can use command-line arguments. -maxlinks, -maxstories parameters
added.  Now does not expire cache dirs every time script is run. LWP,
HTML-Parser and URI now bundled with distribution.  Added all sites in
regional_hk hierarchy, thanks to Albert K T Hui.  Fixed doc to reflect
external config file throughout. .pl files changed to .plx, more
up-to-date with current Perl practice.


2.1.2   Jan 26 2000 jm 

If a redirect happens when a story URL is requested, turn-over links now
use that redirected URL as a base. Richreader support had a bug in the
command line, fixed, thx to Kevin Olson.  Added site file for The
Guardian, thx to Jason Yanowitz <yanowitz /at/ poboxes.com>.  Test suite
written -- and site freshness logic test added, resulting in this finally
getting sorted out to my satisfaction! ;) Palmgear.site updated to use
"palmsized" version of their site. Ditto for LinuxToday.  Bug fixed in
named-anchor links -- anchors can contain wierd chars now, such as spaces
or slashes.  Several new sites added, courtesy of PliNk!
(http://plink.cjb.net/ ).


2.1.1   Jan 17 2000 jm

New sites added from a nifty AltaVista search.  ny_times sites moved to
regional_new_york, as it was unreasonable keeping them in "news" while the
European news sites were kept in "regional_*" categories. Added
linux_magazine.site after they printed an article by Larry Wall ;) Added
http-equiv meta tag refresh support (urgh).  Fixed a few sites with y2k
errors -- embedded 199x dates in URLs or patterns. Added "-sites"
argument. Big change to table-smarts code, now correctly pairs tags at the
expense of a little speed. Diff is now not required; if it's not
available, the new page will be used instead. Problem with 3- or higher
level sites not being scooped due to "HTML has not changed" error now
fixed. New section "politics2000" added to Salon site, thx to rkrichbaum.
Bug with "-refresh -fromcache" fixed, bugfix from mlapsley.


2.1.0	Dec 22 1999 jm 

Preliminary work on image support. Fixed a bug where the conversion script
was not being run in 2.0.2.  Also spaces in sync filename were not
protected against when using iSiloC32.exe on Windows.


2.0.2	Dec 13 1999 jm

Added parameters Level{n}AddURL, IssueAddURL, ContentsAddURL, and
StoryAddURL: throw in a StoryAddURL line and it'll always pick up that
story (unless the text hasn't changed). No longer expires contents of
TextSaveDir and PilotInstallDir, as it has a tendency to cause nasty
accidents if the user accidentally specifies an incorrect directory.
Automatic detection of UNIX pilot desktop software was taking place too
early and causing trouble; fixed. Now strips empty TD,TR,TABLEs etc.


2.0.1	Dec  7 1999 jm

"$HOME" now supported in configuration and site files. Fixed change entry
for 2.0.0 regarding de_ sites, some of them had been obsoleted by a later
one. Added builtin support for JPilot, gnome-pilot, and PilotManager on
UNIX platforms. RPMs are now generated for sitescooper. de_heise_tp.site
updated. Some new weblog sites added (honeyguide, dsl.org). Bugfix for
iSilo with 3-level multi-page sites on Windows machines, which works
around a bug in the iSiloW32 converter. ExceptionURL support added;
ExceptionURL is like LayoutURL, but it takes priority over both LayoutURL
and the normal site file rules. This way you can define bits of a site
that uses different layouts, caching rules etc. by matching pages' URLs
against the ExceptionURL regular expression.

Randy Krichbaum provided some new sites: The Toronto Globe and Mail site
just changed its format so the old site file doesn't work so well anymore.
These modified files (Globe_and_Mail_National.site and
Globe_and_Mail_Business.site) scoop the appropriate sections.  The
Wired_Top_Stories_PP.site file scoops the special Wired page specifically
formatted for the Palm Pilot [now included as wired_news_top_stories to
replace the other one].  The Salon_Magazine.site file scoops from the
current day's issue of Salon Magazine rather than the archives.


2.0.0	Nov 25 1999 jm 

"hh" and "mm" now supported in -prctitle and -filename. Added
de_alpenwetter.site and de_br_news.site from Carsten Clasohm; and
de_tvtoday.site, and changes to de_zdnet.site and de_computerwoche.site
from Stefan. New Mac support fixes, including truncation and hashing of
cache filenames to fit in 32 characters, and builtin diff support using
MJD's diff implementation.  lemondecomplet.site, nouvelobs.site,
libe_portrait_du_jour.site, libe_rebonds.site, libe_q.site,
journaldunet_dossiers.site, echos_infos.site, journaldunet.site all
submitted by Jacques Turb.  StoryHTML{Header,Footer} added courtesy of
Jason Simpson. Problem with links not being linkified in 3-level sites
fixed. Converted documentation and README into HTML files. Login cookies
for subscription sites now supported, using the "-admin import-cookies"
flag. Layouts implemented and used for Wired News. Now under GPL.  Added
support for 'site_choices' file to ease installation and use for
first-timers. Version now in 3 bits, major, minor and patchlevel. Tags now
closed at end of each story.  Now checks HTTP Content-Type and limits
large Content-lengths.


1.9	Oct 19 1999 jm

One-letter images with alt tags are now translated into their
corresponding letter (using the ALT tag). StoryHTMLPreProcess added.
Authentication logins that fail are cleared and the request retried, so
the user can enter the working username & password. Links to anchors
inside documents now work OK in iSilo/HTML/m-iSilo modes.  -noheaders,
-nofooters, -filename, -prctitle functionality added. Bug fixed in
UseTableSmarts code where some <td> tags were being lost.
Thanks to Kennis Koldewyn for NY Times sites, and Pierre-Yves Letournel
for the regional_francais sites.


1.8	Oct  7 1999 jm 

Top-of-story anchors for URLs are now after the printed URL, as it takes
up too much space when surfing through articles.  Bug fix for passwords,
found by KapKom. Plenty of site file fixes: mozillazine, palmstation,
cryptome, javaworld, developernation, and others. Now tries to keep down
size of already_seen file by tracking what the oldest links on the current
contents pages, and removing any cached URLs older than these.  [<<] and
[>>] navigation links added to HTML output.  Also now has support for
[[MM]] etc. keywords in URL patterns, and now you can use [[MM-1]] etc.
to specify offsets from the current date to limit articles to stuff in the
last 3 months, etc. URLProcess inline perl support added as well, to allow
any degree of granularity for filtering which URLs are scooped, in
addition to the StoryURL and ContentsURL parameters.


1.7	Sep 23 1999 jm

-refresh option changed -- now it inhibits diffing and the use of the
cache, entirely. Fixed -doc bug (it was using iSilo instead!)


1.6	Sep 22 1999 jm

A quick bugfix; a few bugs were found in the previous release. -misilo
had problems with 1-level sites and multiple sites overwriting each other;
the doco was wrong, stating that "-doc" was still the default;
"-keep-tmps" and "-stdout-to" added; -refresh now uses .prev cache files
to work well with diffed pages.


1.5	Sep 20 1999 jm 

iSilo output is now the default, unless the -doc (or other) command-line
switch is used.  Carsten Clasohm <cc /at/ clasohm.com> is back, with
another patch -- again, this should provide more reliable diffing
behaviour, this time by putting newlines *before* <p> and <td> tags.
Stefan Schwingeler provides de_ard_videotext, de_dwelle, de_teltarif,
de_gnn, de_yahoo, de_zdnet, and PalmGear site files.  [[MM]] and [[DD]]
tags in URLs were not being padded to two-character width correctly. Added
ExpireCacheAfter keyword to allow better cache control.  Added function to
ensure that the directory a binary is in is added to the command search
path; apparently that can be a problem on Win32 platforms.  Now does not
save index and diffed pages until the end of the run, so that in case
something crashes, you will not lose scooped articles.  multi-page output
now supported for iSilo. -html, -mhtml, -misilo output switches added.
Cleaned up diff preprocessing (again).  RSS support now built-in, with the
new "ContentsFormat: rss" keyword.  rss-to-site.pl script added, and more
RSS doco added to README.


1.4	Aug 31 1999 jm

This is a big change, but it's really lots of little changes ;)

A warning is now output if MakeDocW is used under Windows, and sitescooper
is installed in a directory containing spaces such as "Program Files" --
MakeDocW cannot handle it. Thanks to William Goosey for spotting this.
-pipe switch added. This allows any program to be run to process scooped
documents. Added bugfix for diffing pages with <a href> split over several
lines; thanks to Carsten Clasohm. Timeouts for HTTP gets (on UNIX).
Changed download location for makedoc7.cpp and updated info on iSilo
licensing. Added "-admin" and "-quiet" args. Cleaned up POD documentation,
and added pre-made html and txt to the distribution.  Added [[magic]] date
tokens for URLs and URL patterns, which are replaced with the current
month or year in various formats.  Added zdnet_de.site and
yahoo_de_long.site from Stefan Schwingeler.  (we're up to 46 sites now!)
Altered <pre> handling to handle NTK, and hopefully other <pre> docs,
better. Added Yahoo! Top Stories, from Patrick Clochesy.  Further
modifications for Mac support from Andrew Fletcher, and a new distribution
for Mac users with Mac line endings.


1.3	Jul 20 1999 jm

Incorporated MacOS support from Andrew Fletcher, and RichReader support.
Implemented a size limit on scooped documents (200k by default).  Links
are now de-highlighted if they point to a document that hasn't been
scooped. Also several bugfixes. Now seems quite stable (which has got to
be good news ;).


1.2	Jul  3 1999 jm

Silly bug with DOC format output fixed; iSilo output tweaked to look
better with Internet Alchemy's <strong> tags, links to "printing version"
pages, and generally better rendering of HTML. cnet.site fixed. Url moved
from zap.to/snarfnews to zap.to/sitescooper. Lines now broken at </div>
for diffing. More reliable stray-tag cleanup code added.


1.1	Jun 28 1999 jm

Renamed to sitescooper to avoid confusion with Alec Muffett's SnarfNews
scripts! Also added iSilo support and a preliminary support for
downloading sites as a set of files instead of just one big file.
Quite a big change this one.


1.0	Jun  4 1999 jm

snarfnews now keeps track of which sites it has snarfed, avoiding snarfing 
the same site twice. Some minor warnings fixed. New sites added -- The
Register, Jaundiced Eye, Scripting News, Crypto-Gram, Slashdot (using 
RSS instead of index.html).


0.9	May 26 1999 jm

Now supports finding stories and contents through <FRAME> tags and <LINK>
XML tags, as well as <A HREF>s. Can read RDF files, following links and
keeping titles for those links. Basis for future CGI mode added. Now uses
LWP instead of the homebrew SimpleHTTP stuff -- which also means that
basic authentication to log into websites works. HTML rendering now
cleanest ever (honest). Command-line retrieval of a 2-level site
supported. -dumpprc, -longlines, -levels, -storyurl arguments added.


0.8	May 18 1999 jm

Cleaned up the HTML rendering a little with some regexps from Joe Pfeiffer
<pfeiffer /at/ cs.nmsu.edu> and some hacking of my own. Also avoided a
warning message regarding a my variable which didn't show up in my
versions of perl for some reason. Sites dir is now loaded properly if it's
not the default (~/sites or ./sites). Removed beta status, hopefully
this'll be pretty stable.


0.7	6 May 1999 jm

bugfix: eval statement was not protecting its contents sufficiently,
causing snarfnews to barf with "Win32::TieRegistry not found" errors on
non-win32 platforms. Minor bugfix: the default headline patterns are not
complained about if they are not found.  Also added StoryPostProcess
keyword and scoping, and snarfnews now uses more Netscape-like HTTP GET
request headers.


0.6	24 Apr 1999 jm

some minor bugs fixed, logging for debugging, '-nowrite' mode, errors in
config now flagged with file/line number, PointCast headlines supported by
default.


0.5	Apr 1999 jm

'sites' directory concept; first public release.


0.4	Apr 1999 jm

Bit more doco, file:/// urls, diffing.


0.3	Mar 1999 jm

Supports 3-level sites now.


0.2	Feb 1999 jm

Now supports 'one-big-page' sites.


0.1	Jan 1999 jm

Initial rev.


// vim:sw=2:tw=74:
