Ch 4
·
The
design of IR system should take several aspects into consideration, like
hardware limitations, the size of the collections, the distribution of the
files and computers.
·
Different
algorithms to construct the nonpositional index: blocked sort-based indexing,
single-pass in-memory indexing, distributed indexing, and dynamic indexing.
·
The
distributed indexing algorithm is introduced for the large-scale distributed
system, like World Wide Web
·
Not
quite understand the dynamic indexing algorithm – logarithmic Merging
Ch5
·
The
advantages to compress the index: less disk space, increased use of caching and
faster transfer of data from disk to memory. And in order to improve the cache
utilization and faster disk-to-memory transfer, decompression speeds are
critical and must be high enough.
·
There
are two kinds of compression techniques. One is lossless compression which all
information is preserved. Another one is lossy compression which can provide
better compression ratios.
·
Section
5.1 introduces some useful laws and equations for estimating the statistical
properties of terms in IR system, for example, the heap law for estimating the
number of terms, and ZIP law for modeling the distribution of terms
·
·
front
coding is used for terms with same prefix
·
This
chapter gives two ways to encode small numbers in less space than large
numbers, bytewise compression and bitwise compression. These methods attempt to
encode gaps with the minimum number of bytes and bits, respectively.

No comments:
Post a Comment