Wednesday, January 22, 2014

Unit 3 Reading Notes

Ch 4
·          The design of IR system should take several aspects into consideration, like hardware limitations, the size of the collections, the distribution of the files and computers.
·          Different algorithms to construct the nonpositional index: blocked sort-based indexing, single-pass in-memory indexing, distributed indexing, and dynamic indexing.
·          The distributed indexing algorithm is introduced for the large-scale distributed system, like World Wide Web
·          Not quite understand the dynamic indexing algorithm – logarithmic Merging
Ch5
·          The advantages to compress the index: less disk space, increased use of caching and faster transfer of data from disk to memory. And in order to improve the cache utilization and faster disk-to-memory transfer, decompression speeds are critical and must be high enough.
·          There are two kinds of compression techniques. One is lossless compression which all information is preserved. Another one is lossy compression which can provide better compression ratios.
·          Section 5.1 introduces some useful laws and equations for estimating the statistical properties of terms in IR system, for example, the heap law for estimating the number of terms, and ZIP law for modeling the distribution of terms
·          

·          front coding is used for terms with same prefix

·          This chapter gives two ways to encode small numbers in less space than large numbers, bytewise compression and bitwise compression. These methods attempt to encode gaps with the minimum number of bytes and bits, respectively.

No comments:

Post a Comment