IS2140 Hanying Huang: January 2014

Thursday, January 30, 2014

Unit 4 Reading Notes

Unit 4 1.3, 1.4, 6 IIR

· Section 1.3 introduces the algorithms for the intersection of the posting lists. Moreover, it explains some more algorithms which improve the efficiency of intersection, like merging with the shortest list at first. It examined how to do efficient retrieval via linear time merges and simple query optimization.

· Section 1.4 compares the extended boolean search and ranked retrieval.

· There are several examples of boolean search queries including the use of proximity operators， for example, westlaw.com.

· Boolean queries are precise: a document either matches the query or it does not. This offers the user greater control and transparency over what is retrieved.

· However, boolean queries also has some problems. A general problem with Boolean search is that using AND operators tends to produce high precision but low recall searches, while using OR operators gives low precision but high recall searches.

· Chapter 6 mainly talks about the mechanisms used to rank-order the documents matching a query.

· Weighted zone scoring is sometimes referred to also as ranked Boolean retrieval. The weighted zone score is defined to be

. (A set of documents each of which has l. g_i is the weighted score for each zone. s_i is the Boolean score denoting a match (or absence thereof) between q and the ith zone. For instance, the Boolean score from a zone could be 1 if all the query term(s) occur in that zone, and zero otherwise; indeed, it could be any Boo- lean function that maps the presence of query terms in a zone to 0, 1.)

· The weighted score g can be learned from a set of training examples. (not understand the process)

· Tf-idf is used to weight the score with taking the frequency of the term into consideration.

· Section 6.3, 6.4 explains lots of functions and definition of the use of vector spaces to compute the similarities between different documents and queries.

Monday, January 27, 2014

Unit 3 Muddiest Point

Comparing with other representations, γ encoded postings provide better compression effects.

Does it mean that we always need to encode the postings with γ code?

If not, how can we choose the proper compression model in different circumstances?

Wednesday, January 22, 2014

Unit 3 Reading Notes

Ch 4

· The design of IR system should take several aspects into consideration, like hardware limitations, the size of the collections, the distribution of the files and computers.

· Different algorithms to construct the nonpositional index: blocked sort-based indexing, single-pass in-memory indexing, distributed indexing, and dynamic indexing.

· The distributed indexing algorithm is introduced for the large-scale distributed system, like World Wide Web

· Not quite understand the dynamic indexing algorithm – logarithmic Merging

Ch5

· The advantages to compress the index: less disk space, increased use of caching and faster transfer of data from disk to memory. And in order to improve the cache utilization and faster disk-to-memory transfer, decompression speeds are critical and must be high enough.

· There are two kinds of compression techniques. One is lossless compression which all information is preserved. Another one is lossy compression which can provide better compression ratios.

· Section 5.1 introduces some useful laws and equations for estimating the statistical properties of terms in IR system, for example, the heap law for estimating the number of terms, and ZIP law for modeling the distribution of terms

· front coding is used for terms with same prefix

· This chapter gives two ways to encode small numbers in less space than large numbers, bytewise compression and bitwise compression. These methods attempt to encode gaps with the minimum number of bytes and bits, respectively.

Tuesday, January 14, 2014

Week 2 Muddiest Point

Forgot the answers to these questions.

My answers would be 2 for question1 and 3,5 for question 2

This table is not covered in week 2.

Thursday, January 9, 2014

Week 2 Reading Notes

IIR sections 1.2, chapters 2 and 3

· Section 1.2 illustrates the construction of the inverted index. The indexing result is split into a dictionary and postings. It is important to choose a proper data structure for the posting list, because the size of postings might be extreme large and the storage space need to be taken into consideration.

· Chapter 2 explains the term vocabulary and posting lists in details.

· Different decoders are needed for different documents because of various encoding schemes, like ASCII and UTF-8. And more importantly, the granularity is necessary to index a large document. For example, split the document into small unit, index each unit and then shuffle and merge the separate results into a total result.

· One interesting issue is how to tokenize the document and normalize the term. How to determine the meaningful term and character sequence? Some special characters/words needs further handling, like apostrophe, stop words and hyphenation.

· I agree with the idea that “lowercasing everything often remains the most practical solution”. Because when users are typing the query, they always use the lowercase even though the uppercase is right, such as typing windows instead of Windows.

· In summary, there are several important issues while preprocessing the documents: tokenization, stop words, normalization, stemming, lemmatization,

· There are also different ways to improve the efficiency of IR like posting list intersect with skip pointer.

· Chapter 3 focuses on the data structures that help the search for terms in an inverted index, wildcard query and tolerant retrieval.

· It is the first time I see the terminology “wildcard query”. We can use two trees to handle the wildcard query with single *, the normal B-tree and the reverse B-tree. And there are two techniques for handling the general wildcard query: permuterm indexes and k-gram indexes.

· In section 3.3.1, when two correctly spelled queries are tied (or nearly tied), the author introduces two algorithms to select the one that is more common: considering the number of occurrences of the term in the collection, or using the correction that is most common among queries typed in by other users. In my opinion, I prefer the latter one because it is closer to more users’ information needs.

Not understand the above figure (Figure 3.6 in Chapter 3)