IS2140 Hanying Huang: Unit 10 Reading Notes

Unit 9 Reading Notes

IIR 19 &21

l The essential feature that led to the explosive growth of the web – decentralized content publishing with essentially no central control of authorship – is the biggest challenge for web search engines in their quest to index and retrieve this content.

l Web search users tend to not know (or care) about the heterogeneity of web content, the syntax of query languages and the art of phrasing queries

l Three broad categories into which common web search queries can be grouped: (i) informational, (ii) navigational and (iii) transactional.

l The sampling approach, random queries is noteworthy for two reasons: it has been successfully built upon for a series of increasingly refined estimates, and conversely it has turned out to be the approach most likely to be misinterpreted and carelessly implemented, leading to misleading measurements.

l Chapter 21 focus on the use of hyperlinks for ranking web search results.

l Technique for link analysis:

n PageRank. The PageRank of a node will depend on the link structure of the web graph.

l Given a query, every web page is assigned two scores. One is called its hub score and the other its authority score.

l Procedures for compiling the subset of the Web for which to compute hub and authority scores

n Given a query (say leukemia), use a text index to get all pages containing leukemia. Call this the root set of pages.

n Build the base set of pages, to include the root set as well as any page that either links to a page in the root set, or is linked to by a page in the root set.

n Use the base set for computing hub and authority scores.

Reading: Authoritative Sources in a Hyperlinked Environment

l This paper introduces a technique for locating high-quality information related to a broad search topic on the www, based on a structural analysis of the link topology surrounding “authoritative” pages on the topic.

l It defines an algorithm for identifying hubs and authorities in a sub-graph of the www with respect to a broad search topic and some of the applications of this algorithm.

Reading: The Anatomy of a Large-Scale Hyper-textual Web Search Engine

l Google is designed to provide high quality search results over a rapidly growing World Wide Web.

l The biggest problem facing users of web search engines today is the quality of the results they get back.

l The use of link text as a description of what the link points to helps the search engine return relevant (and to some degree high quality) results. Finally, the use of proximity information helps increase relevance a great deal for many queries.

l Google is efficient in both space and time, and constant factors are very important when dealing with the entire Web. And also Google is a research tool.

IS2140 Hanying Huang

Friday, March 21, 2014

Unit 10 Reading Notes

No comments:

Post a Comment