Friday, March 21, 2014

Unit 10 Reading Notes

Unit 9 Reading Notes
IIR 19 &21
l   The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship – is the biggest challenge for web search engines in their quest to index and retrieve this content.
l   Web search users tend to not know (or care) about the heterogeneity of web content, the syntax of query languages and the art of phrasing queries
l   Three broad categories into which common web search queries can be grouped: (i) informational, (ii) navigational and (iii) transactional.
l   The sampling approach, random queries is noteworthy for two reasons: it has been successfully built upon for a series of increasingly refined estimates, and conversely it has turned out to be the approach most likely to be misinterpreted and carelessly implemented, leading to misleading measurements.
l   Chapter 21 focus on the use of hyperlinks for ranking web search results.
l   Technique for link analysis:
n   PageRank. The PageRank of a node will depend on the link structure of the web graph.
l   Given a query, every web page is assigned two scores. One is called its hub score and the other its authority score.
l   Procedures for compiling the subset of the Web for which to compute hub and authority scores
n   Given a query (say leukemia), use a text index to get all pages containing leukemia. Call this the root set of pages.
n   Build the base set of pages, to include the root set as well as any page that either links to a page in the root set, or is linked to by a page in the root set.
n   Use the base set for computing hub and authority scores.

Reading: Authoritative Sources in a Hyperlinked Environment
l   This paper introduces a technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic.
l   It defines an algorithm for identifying hubs and authorities in a sub-graph of the www with respect to a broad search topic and some of the applications of this algorithm.

Reading: The Anatomy of a Large-Scale Hyper-textual Web Search Engine
l   Google is designed to provide high quality search results over a rapidly growing World Wide Web.
l   The biggest problem facing users of web search engines today is the quality of the results they get back.
l   The use of link text as a description of what the link points to helps the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity information helps increase relevance a great deal for many queries.

l   Google is efficient in both space and time, and constant factors are very important when dealing with the entire Web. And also Google is a research tool.

No comments:

Post a Comment