[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[debian-devel:11919] Re: id-utils (Was: Re: Search)
佐野@浜松です。
debian-www@lists.debian.org から web search に関するメール。
速さとサイズから見ると、どうやらこれで決まりそうですね。
ただし、HTML ファイルについては、まだすぐそのまま使うというわけには
いかないみたいですが。
namazu demi-god の方々から、何かコメントなどありますか ?
In article <msd7opdtj7.fsf@xxxxxxxxxxxxxxxxxxxxxx>,
at 20 Mar 2000 18:02:52 -0700,
on Re: id-utils (Was: Re: Search),
Greg McGary <gkm@xxxxxxxxxxxxxx> さん writes:
> csmall@xxxxxxxxxxxxxxxxxxxxxx (Craig Small) writes:
>
> > > Number of files (C, C++, asm, some text) was 104564.
> > > Total size of indexed files was 2.60 GBytes.
> > > There were 477960 distinct tokens, and the average token occurred 215 times.
> > > High-water mark for memory consumed during indexing was 57 MB.
> > > The size of the output index file was 50.4 MB.
> > > The process took approx 14 minutes of CPU time and 30 minutes of real time.
> >
> > So it took 30 minutes to index 2.6 Gig of "stuff" and the resulting
> > index is 50 Meg? That is definitely impressive and is faster and
> > smaller by several orders of magnitude (most indexers would take, say,
> > 8-16 hours on that size).
> >
> > I still cannot get over its speed! Why is it so fast? Are we missing
> > some important tokens?
>
> It's fast because I worked hard to make it fast! 8^)
> The lexer has a fast, simple inner loop. The in-memory symbol table
> uses a double-hashing with open addressing, and errs on the side of
> sizing tables too large so the collision rate is very low.
>
> > I ran some tests on www.debian.org
> > $ du -s /debian/web/debian.org/{intro,devel,ports,events,News}/
> > 945 /debian/web/debian.org/intro
> > 3304 /debian/web/debian.org/devel
> > 1989 /debian/web/debian.org/ports
> > 837 /debian/web/debian.org/events
> > 4191 /debian/web/debian.org/News
> > $ time mkid -m myid.map -o dwww.id /debian/web/debian.org/{intro,devel,ports,events,News}/
> > [removing some errors about not being able to stat some files]
> > real 0m22.884s
> > user 0m10.020s
> > sys 0m0.430s
>
> For fun, run `mkid -V' to see progress and get statistics at the end
> of the run. `mkid -s' will just give you the stats without the
> progress.
>
> > The main problem is it doesn't understand html pages and the context of
> > them and different languages. So something in the body of a page gets
> > the same weighting as something in the title.
>
> Yes. It needs a specific html scanner. I'm hoping there's a simple
> way to fiddle with the locale at runtime to influence the behavior of
> ctype, so that the scanner can be written in a language-independent
> fashion in terms of isalpha/isdigit. Beyond that, there needs to be
> some special handling for some html directives such as "<meta ...>" to
> extract keywords.
>
> > But you've convinced me at least that it is on the right track, if you
> > need help with some of the programming let me know.
>
> I definitely could use some help! First, you need to assign copyright
> to the FSF for id-utils:
> http://www.gnu.org/software/gcc/fsf-forms/assignment-instructions.html
>
> Can you work on a scanner for HTML & language-specific text? I think
> that would be the best project for someone other than me to do.
>
> Greg
--
# (わたしのおうちは浜松市、「夜のお菓子」で有名さ。)
<kgh12351@xxxxxxxxxxx> : Taketoshi Sano (佐野 武俊)