[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[debian-devel:11919] Re: id-utils (Was: Re: Search)



佐野@浜松です。

 debian-www@lists.debian.org から web search に関するメール。

速さとサイズから見ると、どうやらこれで決まりそうですね。
ただし、HTML ファイルについては、まだすぐそのまま使うというわけには
いかないみたいですが。

 namazu demi-god の方々から、何かコメントなどありますか ?

In article <msd7opdtj7.fsf@xxxxxxxxxxxxxxxxxxxxxx>,
  at 20 Mar 2000 18:02:52 -0700,
   on Re: id-utils (Was: Re: Search),
 Greg McGary <gkm@xxxxxxxxxxxxxx> さん writes:

> csmall@xxxxxxxxxxxxxxxxxxxxxx (Craig Small) writes:
> 
> > > Number of files (C, C++, asm, some text) was 104564.
> > > Total size of indexed files was 2.60 GBytes.
> > > There were 477960 distinct tokens, and the average token occurred 215 times.
> > > High-water mark for memory consumed during indexing was 57 MB.
> > > The size of the output index file was 50.4 MB.
> > > The process took approx 14 minutes of CPU time and 30 minutes of real time.
> > 
> > So it took 30 minutes to index 2.6 Gig of "stuff" and the resulting
> > index is 50 Meg?  That is definitely impressive and is faster and
> > smaller by several orders of magnitude (most indexers would take, say,
> > 8-16 hours on that size).
> > 
> > I still cannot get over its speed!  Why is it so fast? Are we missing
> > some important tokens?
> 
> It's fast because I worked hard to make it fast!  8^)
> The lexer has a fast, simple inner loop.  The in-memory symbol table
> uses a double-hashing with open addressing, and errs on the side of
> sizing tables too large so the collision rate is very low.
> 
> > I ran some tests on www.debian.org
> > $ du -s /debian/web/debian.org/{intro,devel,ports,events,News}/
> > 945     /debian/web/debian.org/intro
> > 3304    /debian/web/debian.org/devel
> > 1989    /debian/web/debian.org/ports
> > 837     /debian/web/debian.org/events
> > 4191    /debian/web/debian.org/News
> > $ time mkid -m myid.map -o dwww.id /debian/web/debian.org/{intro,devel,ports,events,News}/
> > [removing some errors about not being able to stat some files]
> > real    0m22.884s
> > user    0m10.020s
> > sys     0m0.430s
> 
> For fun, run `mkid -V' to see progress and get statistics at the end
> of the run.  `mkid -s' will just give you the stats without the
> progress.
> 
> > The main problem is it doesn't understand html pages and the context of
> > them and different languages.  So something in the body of a page gets 
> > the same weighting as something in the title.
> 
> Yes.  It needs a specific html scanner.  I'm hoping there's a simple
> way to fiddle with the locale at runtime to influence the behavior of
> ctype, so that the scanner can be written in a language-independent
> fashion in terms of isalpha/isdigit.  Beyond that, there needs to be
> some special handling for some html directives such as "<meta ...>" to
> extract keywords.
> 
> > But you've convinced me at least that it is on the right track, if you
> > need help with some of the programming let me know.
> 
> I definitely could use some help!  First, you need to assign copyright
> to the FSF for id-utils:
> http://www.gnu.org/software/gcc/fsf-forms/assignment-instructions.html
> 
> Can you work on a scanner for HTML & language-specific text?  I think
> that would be the best project for someone other than me to do.
> 
> Greg

-- 
     # (わたしのおうちは浜松市、「夜のお菓子」で有名さ。)
    <kgh12351@xxxxxxxxxxx> : Taketoshi Sano (佐野 武俊)