THE GOOGLE SIMILARITY
DISTANCE
ABSTRACT
Words and phrases
acquire meaning from the way they are used in society, from their relative
semantics to other words and phrases. For computers the equivalent of ‘society’
is ‘database,’ and the equivalent of ‘use’ is ‘way to search the database.’ We
present a new theory of similarity between words and phrases based on
information distance and Kolmogorov complexity.
To fix thoughts we use
the world-wide-web as database, and Google as search engine. The method is also
applicable to other search engines and databases. This theory is then applied
to construct a method to automatically extract similarity, the Google
similarity distance, of words and phrases from the world-wide web using Google
page counts.
The world-wide-web is
the largest database on earth, and the context information entered by millions
of independent users averages out to provide automatic semantics of useful
quality. We give applications in hierarchical clustering, classification, and
language translation.
We give examples to
distinguish between colors and numbers, cluster names of paintings by 17th
century Dutch masters and names of books by English novelists, the ability to
understand emergencies, and primes, and we demonstrate the ability to do a
simple automatic English-Spanish translation. Finally, we use the WorldNet
database as an objective baseline against which to judge the performance of our
method. We conduct a massive randomized trial in binary classification using
support vector machines to learn categories based on our Google distance,
resulting in an a mean agreement of 87% with the expert crafted WorldNet
categories.
system
analysis
EXISTING SYSTEM
Since Google is the most popular search
engine, many webmasters
have become eager to influence their website's Google rankings. An industry of
consultants has arisen to help websites increase their rankings on Google and
on other search engines. This field, called search engine optimization, attempts to
discern patterns in search engine listings, and then develop a methodology for
improving rankings to draw more searchers to their client's sites.
PROPOSED
SYSTEM
Apart from the problems of scaling
traditional search techniques to data of this Magnitude, there are new
technical challenges involved with using the additional information present in
hypertext to product better search results.... Fast crawling technology is
needed to gather the Web documents and keep them up to date. Storage space must
be used efficiently to store indices and, optionally, the documents themselves.
The indexing system must process hundreds of gigabytes of data efficiently.
Queries must be handled quickly, at the rate of hundreds to thousands per
second.
RESOURCE
REQUIREMENT
Hardware Requirements:
Processor : Pentium 4
Processor
Speed : 2.40GHz
RAM : 512 MB
Hard
Disk : 80GB
CD
Drive : Samsung 52X
Software
Requirements:
Environment :
Visual studio .NET 2005
.NET Framework :
VERSION 2.0
Language :
ASP.NET with C#
Operating
System : WINDOWS 2000/XP
Back
End :
SQL SERVER 2000
No comments:
Post a Comment