[PW] Archives

John Franklin jfranklin at project-wombat.org
Fri Jan 19 20:02:16 PST 2007


Before I begin my long-winded reply (I'm a techie nerd at heart; I  
can't resist answering a question like this in detail), I'd like to  
mention: I've received some notes thanking me for the server  
changeover. You're all welcome, but the real hero here is Mr. Newby,  
our server administrator, who did all the heavy lifting. I will pass  
on the thanks to him.

Sorry for the delay in answering this; I spent the day outside, and  
it took some time to thaw out when I finally got home.

Mea culpa on the non-functional search boxes. I set them up when the  
list was new, and they seemed to work at the time. They no longer do.  
(There is no need to send me a reply quoting that sentence and saying  
"duh". I get it, thank you.) Test early and test often, I suppose.

We have only two choices on the subject because Project Wombat uses  
Mailman. The old server (that is, the OLD old server, the one at a  
certain academic institution which shall remain nameless) used  
Listserv. Listserv has a built-in search function which allows some  
nice boolean logic in its queries. Unfortunately, Listserv's manual  
makes it pretty clear that their software is not designed for a list  
like this one, where messages are basically archived for eternity.  
Those of you who were members of the old list may remember that the  
search engine was terribly slow unless you limited your search to a  
particular time period; now you know why: Listserv expects a few  
thousand archived messages at most to be retained for the long term,  
and we had... a lot more than that. (The Open version of the current  
list generates over a thousand messages every three months.  
Considering that the old list ran for about a dozen years, although  
not always at that clip, it's hardly surprising that Listserv was  
overwhelmed.)

The current version of Mailman, on the other hand, has NO built-in  
search function. In fact, it doesn't even store the archives in a  
"real" database, which would allow me to write one. (According to  
their development site, the next major release will solve both these  
problems -- the latter implies the former, since writing a basic  
search form for an SQL-driven system is trivial -- but the last time  
I checked, it had a release date of "someday".) The options under the  
current version are: allow some third party to build archives based  
on the contents of your archives, or allow the standard search  
engines to index your archives and use their results. Since we will  
eventually upgrade to the next major release of Mailman, it doesn't  
seem worthwhile to try to arrange for an indexing service when -- in  
theory -- Google will do the dirty work for us, without adding any  
administrative overhead.

As you can find out by reading their documentation, Google has all  
these nice options to narrow down your search. You can restrict your  
search to a particular domain; you can restrict to a particular type  
of document; you can exclude certain terms; etc. etc. etc..  
Unfortunately, some of the possible settings break the search. Even  
more unfortunately, they don't quite break it enough to be instantly  
noticeable if your tests only range over a few pages, as did mine  
back when I set up the forms. In this case, I set up the search form  
to restrict the search to lists.project-wombat.org/pipermail/[list- 
variant-name]/ because that made the search only return results from  
that particular list, and then restricted it to documents of type  
"txt" because all the individual messages were returned by the server  
that way, which prevented any of the archive listings from popping up  
because they contain a list of subject lines. Clever, no? (The idea,  
that is, not the run-on sentence.)

Unfortunately, I didn't notice two things. First off, although Google  
will treat "html" (with no period) and ".html" (with a period) as  
identical document types, it will not do the same for "txt" and  
".txt". If you restrict your search to documents of type "txt" (with  
no period), it will turn up fewer results, and sometimes returns  
nothing at all. (At least, that's how it behaves on our server.)  
Since all the individual messages on the server are retrieved as  
".txt" files, giving the wrong value (as I seem to have done on the  
existing search form) stops the search from returning anything. I  
really should have noticed this. (Maybe it used to work and no longer  
does? I'd like to think I didn't knowingly set up a form that ALWAYS  
failed...)

The other problem I didn't notice (and this one would have been  
impossible to detect at the time) is that Google will still return  
results if the "restrict to this site" option isn't functioning  
properly. In actual fact, restricting to a single variant of the list  
using the formula above makes the search return three results at most  
when using the format in the search form. I don't know why this  
should be -- Google seems to know perfectly well that the other  
results (which show up if you just restrict to the server instead of  
the whole path) are still there, but they drop out.

(I've also asked Mr. Newby, our administrator, whether it would be  
possible for us to map a few new domains to the individual lists,  
thus forcing Google to search only one list at a time.)

I will try to fix the search form to do the right thing, which will  
mean lots of duplicate results (on the other hand, better duplicate  
results than no results), but right now I can't. I just tried, and an  
old permissions problem which had been fixed on the old server (the  
NEW old server, this time -- the one we were using until a week ago)  
is still broken on this installation, so I can't alter the template  
files at all for the time being.

As for adding an archive link on the website, which was Mr. Dyson's  
original request: it shall be done, but not right away. That website  
has been overdue for a complete overhaul for a while now, but I  
haven't had time to work on it, what with one thing and another. Look  
at the bright side: the new server has slightly different paths for  
some things, and the archive links on the list information pages at  
least work, even if they're buried. If I had added an archive link to  
the website, it might have broken when we moved servers.

-John Franklin

On Jan 19, 2007, at 2:25 PM, Dennis Lien wrote:

> At 12:43 PM 1/19/2007, you wrote:
>> Hello John,
>>
>> Would it be possible to put a direct link to the archives in the
>> navigation box of the P-W home page? I confess to a rickety memory  
>> for
>> where the archives are currently located. On the occasions when I  
>> wish
>> to consult them, I have a hard time remembering the logic of why the
>> link occurs on its present page but not on any other.
>>
>> Also, what are the chances of a key-word search option in the  
>> archives?
>> For example, Boris + Natasha or life + Captain for every time those
>> appear in the same context.
>>
>> I'm copying the list in case others might find such notions useful.
>>
>> Thanks,
>>
>> John Dyson
>
>
> Actually, the PW page at
>
> http://lists.project-wombat.org/mailman/listinfo.cgi/project-wombat
>
> *has* a "Search the Project Wombat Archives for" box -- but in my
> experience it's never worked.  (Is it just me?)  Any search I've put
> in comes back with (usually) nothing other than an error message or
> (rarely) at most one or two random entries for a word that should be
> appearing scores of times...
>
> A search for Dyson, for instance, gives me this reply:
>
> Your search - Dyson -filetype:txt - did not match any documents.
>
> and one for the word    copyright   produces only this single  
> offering:
>
> [PW] Re: Ordeal by Cheque short story
> The author is "Wuther Crue"; the story appeared in a 1932 Vanity  
> Fair >
> issue; the author renewed the copyright in 1960 and 1988. ...
> lists.project-wombat.org/pipermail/project-wombat/2006-March/ 
> 001252.html -
> 6k - Cached - Similar pages
>
>
> Dennis Lien / U of Minnesota Libraries // d-lien at umn.edu
>
>
> _______________________________________________
> Project Wombat
> list at project-wombat.org
> http://www.project-wombat.org/



More information about the Project-Wombat mailing list