Unit
9 Reading Notes
IIR
19 &21
l
The
essential feature that led to the explosive growth of the web – decentralized
content publishing with essentially no central control of authorship – is the
biggest challenge for web search engines in their quest to index and retrieve
this content.
l
Web
search users tend to not know (or care) about the heterogeneity of web content,
the syntax of query languages and the art of phrasing queries
l
Three
broad categories into which common web search queries can be grouped: (i)
informational, (ii) navigational and (iii) transactional.
l
The
sampling approach, random queries is noteworthy for two reasons: it has been
successfully built upon for a series of increasingly refined estimates, and
conversely it has turned out to be the approach most likely to be
misinterpreted and carelessly implemented, leading to misleading measurements.
l
Chapter
21 focus on the use of hyperlinks for ranking web search results.
l
Technique
for link analysis:
n
PageRank.
The PageRank of a node will depend on the link structure of the web graph.
l
Given
a query, every web page is assigned two scores. One is called its hub score and
the other its authority score.
l
Procedures
for compiling the subset of the Web for which to compute hub and authority
scores
n
Given
a query (say leukemia), use a text index to get all pages containing leukemia.
Call this the root set of pages.
n
Build
the base set of pages, to include the root set as well as any page that
either links to a page in the root set, or is linked to by a page in the root
set.
n
Use
the base set for computing hub and authority scores.
Reading:
Authoritative Sources in a Hyperlinked Environment
l
This
paper introduces a technique for locating high-quality information related to a
broad search topic on the www, based on a structural analysis of the link
topology surrounding “authoritative” pages on the topic.
l
It
defines an algorithm for identifying hubs and authorities in a sub-graph of the
www with respect to a broad search topic and some of the applications of this
algorithm.
Reading:
The Anatomy of a Large-Scale Hyper-textual
Web Search Engine
l
Google
is designed to provide high quality search results over a rapidly growing World
Wide Web.
l
The
biggest problem facing users of web search engines today is the quality of the
results they get back.
l
The
use of link text as a description of what the link points to helps the search
engine return relevant (and to some degree high quality) results. Finally, the
use of proximity information helps increase relevance a great deal for many
queries.
l
Google
is efficient in both space and time, and constant factors are very important
when dealing with the entire Web. And also Google is a research tool.
No comments:
Post a Comment