Klarinet Archive - Posting 000587.txt from 1997/04

From: Mark Charette <charette@-----.com>
Subj: Sneezy databases
Date: Fri, 18 Apr 1997 19:57:24 -0400

I probably have to re-write that help file - I was rushing a bit
trying to do this before the weekend.

Anyway, you have to specify at least two things when you're searching:

where to look and what to look for

The "where" is the databases: clarinet1992-1997, composer, discography,
aneezy. Any combination.

The "what" is the keywords you're interested in. Let's say you want
to do research on the Nielsen Concerto, and would like to know
EVERYTHING on sneezy that references it. My first stab at it might
look like:

databases clarinet1992 clarinet1993 clarinet1994 clarinet1995
databases clarinet1996 clarinet1997 discography composers sneezy
find nielsen concerto

(The only reason I have 2 lines with the word "databases" on them
is because I don't like writing long lines).

This will find everything and more because of one little problem:
find nielsen concerto
really means

find nielsen OR concerto

and I'll get back possibly 1000s of entries out of the 38000
records kept on sneezy, probably more records than most people
thought were available. Any entry with the word "concerto" or
"nielsen" will be returned. This is pretty common for search
engines, which is why most of them allow an "advanced" search
criteria.

What we probably REALLY wanted was the phrase:

find nielsen AND concerto

which will return only those entries where BOTH words occur
We'll probably still get back some spurious entries, but at least
we'll be much closer to what we want. The find command will
examine every Klarinet archive along with the discography,
the composer database, and through every HTML page on sneezy
(including Stan Geidel's famous & fabulous Online Clarinet
Resource and my own Clarinet Info Pages).

The system is divided into multiple databases for two reasons:
1) Reindexing all the words in the entire set of Klarinet archives
takes over 20 minutes; reindexing just this years takes about 3
minutes
2) The amount of data returned by a "loose" query might return
more data than your mailer can handle - searching one year just
to check out your query will probably keep you from getting a
big mail hit by accident.

The search mechanism on the majordomo server was very similar to
this, but it limited the amount of I/O and CPU that would be
consumed for a query and would not return the full results of a
large query (along with not having the older archive files
available).

Simple queries work just fine still:

database clarinet1997
find pedlar

still works as expected.

The Web interfaces
(http://sneezy.mika.com/clarinet/Databases) are still
there, too. The e-mail edition is specifically set up for
those with no Web access or for people who'd like to peruse
the data off-line.

Sneezy is not associated at all with the Klarinet
mailing list, even though I have help files and
status reports available. Sneezy has an entry in the majordomo
mailing list so that a copy of every Klarinet message gets
saved off, and the status and help messages are just a public
service type of thing (the status messages are MY OPINION of
list status only, not any official word). However, if sneezy
doesn't get Klarinet messages, then there's an indication of
trouble somewhere. The Klarinet messages get saved and indexed
periodically into my set of archives (normally after I have
100Kb of unindexed messages).

I'll be more than happy to distribute the entire set of
archived Klarinet messages (about 30,000 messages I think)
in tar format on 4mm tape (42.4 Mb uncompressed). You supply
the tape & return postage. E-mail me for further details.
--
Mark Charette "How can you be in two places at once
charette@-----.com when you're not anywhere at all?"
http://sneezy.mika.com/clarinet - Firesign Theater

   
     Copyright © Woodwind.Org, Inc. All Rights Reserved    Privacy Policy    Contact charette@woodwind.org