[PW] Archives
John Franklin
jfranklin at project-wombat.org
Fri Jan 19 20:02:16 PST 2007
Before I begin my long-winded reply (I'm a techie nerd at heart; I
can't resist answering a question like this in detail), I'd like to
mention: I've received some notes thanking me for the server
changeover. You're all welcome, but the real hero here is Mr. Newby,
our server administrator, who did all the heavy lifting. I will pass
on the thanks to him.
Sorry for the delay in answering this; I spent the day outside, and
it took some time to thaw out when I finally got home.
Mea culpa on the non-functional search boxes. I set them up when the
list was new, and they seemed to work at the time. They no longer do.
(There is no need to send me a reply quoting that sentence and saying
"duh". I get it, thank you.) Test early and test often, I suppose.
We have only two choices on the subject because Project Wombat uses
Mailman. The old server (that is, the OLD old server, the one at a
certain academic institution which shall remain nameless) used
Listserv. Listserv has a built-in search function which allows some
nice boolean logic in its queries. Unfortunately, Listserv's manual
makes it pretty clear that their software is not designed for a list
like this one, where messages are basically archived for eternity.
Those of you who were members of the old list may remember that the
search engine was terribly slow unless you limited your search to a
particular time period; now you know why: Listserv expects a few
thousand archived messages at most to be retained for the long term,
and we had... a lot more than that. (The Open version of the current
list generates over a thousand messages every three months.
Considering that the old list ran for about a dozen years, although
not always at that clip, it's hardly surprising that Listserv was
overwhelmed.)
The current version of Mailman, on the other hand, has NO built-in
search function. In fact, it doesn't even store the archives in a
"real" database, which would allow me to write one. (According to
their development site, the next major release will solve both these
problems -- the latter implies the former, since writing a basic
search form for an SQL-driven system is trivial -- but the last time
I checked, it had a release date of "someday".) The options under the
current version are: allow some third party to build archives based
on the contents of your archives, or allow the standard search
engines to index your archives and use their results. Since we will
eventually upgrade to the next major release of Mailman, it doesn't
seem worthwhile to try to arrange for an indexing service when -- in
theory -- Google will do the dirty work for us, without adding any
administrative overhead.
As you can find out by reading their documentation, Google has all
these nice options to narrow down your search. You can restrict your
search to a particular domain; you can restrict to a particular type
of document; you can exclude certain terms; etc. etc. etc..
Unfortunately, some of the possible settings break the search. Even
more unfortunately, they don't quite break it enough to be instantly
noticeable if your tests only range over a few pages, as did mine
back when I set up the forms. In this case, I set up the search form
to restrict the search to lists.project-wombat.org/pipermail/[list-
variant-name]/ because that made the search only return results from
that particular list, and then restricted it to documents of type
"txt" because all the individual messages were returned by the server
that way, which prevented any of the archive listings from popping up
because they contain a list of subject lines. Clever, no? (The idea,
that is, not the run-on sentence.)
Unfortunately, I didn't notice two things. First off, although Google
will treat "html" (with no period) and ".html" (with a period) as
identical document types, it will not do the same for "txt" and
".txt". If you restrict your search to documents of type "txt" (with
no period), it will turn up fewer results, and sometimes returns
nothing at all. (At least, that's how it behaves on our server.)
Since all the individual messages on the server are retrieved as
".txt" files, giving the wrong value (as I seem to have done on the
existing search form) stops the search from returning anything. I
really should have noticed this. (Maybe it used to work and no longer
does? I'd like to think I didn't knowingly set up a form that ALWAYS
failed...)
The other problem I didn't notice (and this one would have been
impossible to detect at the time) is that Google will still return
results if the "restrict to this site" option isn't functioning
properly. In actual fact, restricting to a single variant of the list
using the formula above makes the search return three results at most
when using the format in the search form. I don't know why this
should be -- Google seems to know perfectly well that the other
results (which show up if you just restrict to the server instead of
the whole path) are still there, but they drop out.
(I've also asked Mr. Newby, our administrator, whether it would be
possible for us to map a few new domains to the individual lists,
thus forcing Google to search only one list at a time.)
I will try to fix the search form to do the right thing, which will
mean lots of duplicate results (on the other hand, better duplicate
results than no results), but right now I can't. I just tried, and an
old permissions problem which had been fixed on the old server (the
NEW old server, this time -- the one we were using until a week ago)
is still broken on this installation, so I can't alter the template
files at all for the time being.
As for adding an archive link on the website, which was Mr. Dyson's
original request: it shall be done, but not right away. That website
has been overdue for a complete overhaul for a while now, but I
haven't had time to work on it, what with one thing and another. Look
at the bright side: the new server has slightly different paths for
some things, and the archive links on the list information pages at
least work, even if they're buried. If I had added an archive link to
the website, it might have broken when we moved servers.
-John Franklin
On Jan 19, 2007, at 2:25 PM, Dennis Lien wrote:
> At 12:43 PM 1/19/2007, you wrote:
>> Hello John,
>>
>> Would it be possible to put a direct link to the archives in the
>> navigation box of the P-W home page? I confess to a rickety memory
>> for
>> where the archives are currently located. On the occasions when I
>> wish
>> to consult them, I have a hard time remembering the logic of why the
>> link occurs on its present page but not on any other.
>>
>> Also, what are the chances of a key-word search option in the
>> archives?
>> For example, Boris + Natasha or life + Captain for every time those
>> appear in the same context.
>>
>> I'm copying the list in case others might find such notions useful.
>>
>> Thanks,
>>
>> John Dyson
>
>
> Actually, the PW page at
>
> http://lists.project-wombat.org/mailman/listinfo.cgi/project-wombat
>
> *has* a "Search the Project Wombat Archives for" box -- but in my
> experience it's never worked. (Is it just me?) Any search I've put
> in comes back with (usually) nothing other than an error message or
> (rarely) at most one or two random entries for a word that should be
> appearing scores of times...
>
> A search for Dyson, for instance, gives me this reply:
>
> Your search - Dyson -filetype:txt - did not match any documents.
>
> and one for the word copyright produces only this single
> offering:
>
> [PW] Re: Ordeal by Cheque short story
> The author is "Wuther Crue"; the story appeared in a 1932 Vanity
> Fair >
> issue; the author renewed the copyright in 1960 and 1988. ...
> lists.project-wombat.org/pipermail/project-wombat/2006-March/
> 001252.html -
> 6k - Cached - Similar pages
>
>
> Dennis Lien / U of Minnesota Libraries // d-lien at umn.edu
>
>
> _______________________________________________
> Project Wombat
> list at project-wombat.org
> http://www.project-wombat.org/
More information about the Project-Wombat
mailing list