COMP249: Web Search
Steve Cassidy and Rolf Schwitter
The Web as Data Store
- The Web can be seen as a very large, unstructured data
store
- There exist hundreds of millions of Web pages, but
there is no central index
- Even worse: it is unknown where all the Web servers
are
- The solution: search engines and directories
How does a Search Engine work?
Have you ever wondered ...
- ... why a search engine is so fast?
- ... how the search engine can find something?
- ... where the search happens?
Search Engines
A search engine basically consists of:
- A
crawler that finds and fetches web pages
("offline")
- An
indexer that extracts words from web pages,
generates an inverted index and stores this index in a huge
database ("offline")
- The
query processor compares a search query of the user
with the index and recommends the most relevant documents
("online")
Collecting and Storing the Data
Web Crawlers
- The Web can be seen as a network consisting of nodes (pages)
and arcs (links)
- There is no unique starting point to access all web
pages
- Take heavily used servers with very popular pages as
starting point
- Index these web pages and follow hyperlinks to other
pages
- Continue indexing until you only see pages that are
already indexed
- This is in essence the task of a
crawler (also called spider) of a search
engine
Term Indexing
- Terms (important words) are extracted by the indexer
from web pages that the crawler retrieved
- Stop words (function words) are not extracted by the
indexer
- The extracted terms are stored as
inverted index with pointers to the documents:
Term1: [URL1, URL2, URL3]
Term2: [URL2, URL4, URL5, URL6]
- Weights are (sometimes) assigned to each entry
- Hits are automatically ranked and presented by
relevance
- The algorithms for ranking are often company secrets
(Google considers over a 100 factors)
What Information to Keep?
Relevance - Internal Factors
Factors based on a Web page's content:
- Word frequency (how often search terms occur in a
page)
- Location of search terms in the page (title, top of the
page)
- Relational clustering (how many pages in a site contain
a term)
- Page design (does it use frames, does it load
fast)
Relevance - External Factors
Factors external to the page:
- Link popularity (number of links pointing to a
page)
- Click popularity (pages visited more often are
prioritised)
- Sector popularity (pages visited by demographic
groups)
- Business alliances among services
- Pay-for-placement rankings
How Google PageRank Works
- Google relies on the democratic nature of the Web
- A link from page A to page B is a vote by A for B
- In addition Google analyses the page that casts the
vote
- Votes from pages that are "important" weigh more
heavily
- These votes help to make other pages "important"
- Google combines page ranking with text-matching
techniques
- Pages have to be both important and relevant to a
search
- Read:
How
Google
works
Boolean Search
Boolean Search
- Idea: Convert boolean operations into set operations
- Taking advantage of the information stored in
inverted indices
- Return the pages that satisfy the boolean query
- Need to implement an additional method to rank the
pages
Boolean Search - Example
- Documents:
D1 = {unit, comp225, exam},
D2 = {comp225},
D3 = {unit, comp225, comp248, exam},
D4 = {unit, comp248},
D5 = {comp248, comp249, exam}
D6 = {unit}, D7 = {unit, comp249}
- Inverted index:
unit: {D1, D3, D4, D6, D7}
comp225: {D1, D2, D3}
comp248: {D3, D4, D5}
comp249: {D5, D7}
exam: {D1, D3, D5}
Example (cont'd)
- The query:
unit & (comp225 or comp249) & not exam
- Map
& to INTERSECTION, or to UNION, & not to DIFFERENCE:
- The equivalent set operation for this query is:
({D1, D3, D4, D6, D7}
INTERSECTION
({D1, D2, D3} UNION {D5, D7})
)
DIFFERENCE
{D1, D3, D5} = {D7}
Implicit Boolean Queries
- But my favourite search engine does not use boolean
queries ... does it?
- Have a look at the advanced search mode
- Implied boolean query
- By default all query words are connected with &
(or)
- +: explicit &
- -: explicit negation
Making use of Meta-Data
What are Meta Tags?
- Meta tags are data that are not displayed by the
browser but they can be accessed by crawlers
- Meta tags are inserted between
<head>and
</head>tags
- Meta tags allow you to interact with the search
engines
Alternatives to Search Engines
Directories
- Directories (Yahoo!,
dmoz) have real people who review
and index links
- Directory editors look at the quality of a site:
- functionality
- content
- design
- As a result, directories, indexes tend to contain
high-quality links, but:
- It may take long for a web page to be included in the
directory
- The coverage of the directory is much smaller than
that of a search engine
Hybrid Search Engines
- Hybrid search engines combine a directory with a search
engine
- Most search sites are hybrids
- For example, Yahoo! started out as a directory
- Yahoo! later supplemented its manually compiled listings with
search results from Google
- Yahoo! announced the use of an in-house search engine
instead (
The Australian , 24 Feb 2004)
- Google uses Open Directory Project's directory to
enrich its automatically generated listings
Metasearch
- Unlike search engines, meta search engines do not crawl
the Web
- Instead, they allow searches to be sent to several
search engines
- The results are then blended together onto one
page
Limitations of Search Engines
The "Invisible" Web
- Opaque Web: search engines choose not to index
- Private Web: password protected
- Proprietary Web: registration required (either fee or free)
- Truly invisible Web:
- cannot search certain file formats and
databases
- most crawlers do not crawl sites that include
scripts
Recall
- Not all the relevant web pages are found:
- Because they were not indexed
- Because the retrieval engine failed to spot their
relevance
- Recall is the number of documents the search engine got
right divided by the number of possible right
documents
- Sometimes high recall is not important, as long as we
find
one web page that is relevant
Precision
- Irrelevant web pages can be found
- How is this possible?
- Ambiguous terms (e.g. "chips (silicon)" and "chips (food)" and "chips (wood)")
- Spamming techniques (e.g. the authors introduce irrelevant metadata)
- The topic of the web page is just missed (need for
more sophisticated language technology techniques)
- A document is more than the sum of its
keywords
- Precision is the number of documents the search engine
got right divided by the number of documents the search
engine found.
Example
To find web pages about sales of beef in Chile, try the following queries:
Future Developments
Concept Based Search
- One research area is concept-based searching:
- car, automobile (synonymy)
- station wagon, car, vehicle (hyponymy)
- Some of this research involves using
- statistical analysis
- ontologies (collections of standardised terms)
in order to find other pages you might be interested
in
- The information stored is greater for a concept-based
search engine
- Far more processing is required for each search
Natural Language-based Search
- The idea behind natural-language queries is that you
can type a question in the same way you would ask it to a
human
Who is the president of the United States?
- No need to keep track of Boolean operators or query
structures
- The most popular natural language query site is
AskJeeves
- AskJeeves parses the query for keywords
- It then applies the keywords to a list of canned
questions of which it knows the answer (typically linked to
generic Web databases)
- Recent entrant: Powerset
tries to understand the target documents.
Web-based Question Answering
- We start seeing sites that search the Web for specific
facts
- Rather, "factoids": the answer may be wrong!
- Examples:
-
START[http://www.ai.mit.edu/projects/infolab/]
-
BrainBoost[http://www.brainboost.com/]
- Question answering is a hot research topic (see (
TREC[http://trec.nist.gov] ))