THE GOOGLE SIMILARITY
DISTANCE
ABSTRACT
Words and phrases acquire meaning from the way they are used in society,
from their relative semantics to other words and phrases. For computers the
equivalent of ‘society’ is ‘database,’ and the equivalent of ‘use’ is ‘way to
search the database.’ We present a new theory of similarity between words and
phrases based on information distance and Kolmogorov complexity. To fix
thoughts we use the world-wide-web as database, and Google as search engine.
The method is also applicable to other search engines and databases. This
theory is then applied to construct a method to automatically extract
similarity, the Google similarity distance, of words and phrases from the
world-wide web using Google page counts.
The world-wide-web is the largest database on earth, and the context
information entered by millions of independent users averages out to provide
automatic semantics of useful quality. We give applications in hierarchical
clustering, classification, and language translation. We give examples to
distinguish between colors and numbers, cluster names of paintings by 17th
century Dutch masters and names of books by English novelists, the ability to
understand emergencies, and primes, and we demonstrate the ability to do a simple
automatic English-Spanish translation. Finally, we use the WorldNet database as
an objective baseline against which to judge the performance of our method. We
conduct a massive randomized trial in binary classification using support
vector machines to learn categories based on our Google distance, resulting in
an a mean agreement of 87% with the expert crafted WorldNet categories.
INTRODUCTION
Objects can be given literally, like the literal
four-letter genome of a mouse, or the literal text of war and peace by Tolstoy.
For simplicity we take it that all meaning of the object is represented by the
literal object itself. Objects can also be given by name, like “the four-letter
genome of a mouse,” or “the text of war and peace by tolstoy.” there are
also objects that cannot be given literally, but only by name, and that acquire
their meaning from their contexts in background common knowledge in humankind,
like “home” or “red.” to make computers more intelligent one would like to
represent meaning in computer-digestible form. Long-term and labor-intensive efforts
like the cycle project and the word net project try to establish
semantic relations between common objects, or, more precisely, names for
those objects. The idea is to create a semantic web of such vast proportions that
rudimentary intelligence, and knowledge about the real world, spontaneously
emerge. This comes at the great cost of designing structures capable of
manipulating knowledge, and entering high quality contents in these structures
by knowledgeable human experts.
While the efforts are long running and large scale,
the overall information entered is minute compared to what is available on the
world-wide web. The rise of the world-wide-web has enticed millions of users to
type in trillions of characters to create billions of web pages of on average
low quality contents. The sheer mass of the information about almost every
conceivable topic makes it likely that extremes will cancel and the majority or
average is meaningful in a low-quality approximate sense. We devise a general
method to tap the amorphous low-grade knowledge available for free on the
world-wide web, typed in by local users aiming at personal gratification of
diverse objectives, and yet globally achieving what is effectively the largest
semantic electronic database in the world. Moreover, this database is available
for all by using any search engine that can return aggregate page-count estimates
for a large range of search-queries, like Google.
PROJECT
DESCRIPTION
This project
having Seven Modules.
1. User Page
2. Home Page
3. Adding New Keyword
4. New user registration
5. User login
6. Feedback
7. Search engine
Module 1: USER PAGE
This
is the First Module. Module Name is User Page. User can go with any link such as
Home Page, Books, Images, Maps, Search
Keywords and News.
Module 2: HOME PAGE
Home
Page is second Module of this project. User can enter into this Page and giving
any corresponding Page informative Keywords to Textbox. Then He will submit it.
That keyword is search in table called Google and display that corresponding
Information to this web Page.
Module 3: ADDING NEW KEYWORD
Third
Module of this Project is Adding New Keyword. Admin can have permission to add
keywords to Google table.Admin can enter keyword, data and author name to Google
table.
Module 4: NEW USER REGISTRATION
Fourth
Module of this Project is New User Registration. User can enter user details
such as Use rid Username, Password, Conform Password, Gender, and Address. These details are stored in New Forms
database.
Module 5: USER LOGIN
This
Module is User Login. This is our Fifth Module Of this project. User can give
valid username and password that details checked and valid user only can give
keyword, data and author name that details are stored in Google table.
Module 6: FEEDBACK
Sixth
Module of this project is Feedback. Use can give any type of feedback, for that
he/she will provide their user id, username and description. At last they
submit this form. It will be stored in Feedback details database.
Module 7: SEARCH ENGINE
This
module is search Engine Module. User can utilize search Engine Process. They
will give keyword that will be searched by search engine then the result will
be displayed from database.
RESOURCE
REQUIREMENT
Hardware Requirements:
Processor
: Pentium 4
Processor
Speed : 2.40GHz
RAM : 512 MB
Hard
Disk :
80GB
Software Requirements:
Environment :
Visual studio .NET 2005
.NET Framework :
VERSION 2.0
Language : ASP.NET with C#
Operating
System : WINDOWS 2000/XP
Back
End : SQL SERVER 2000
system
analysis
EXISTING SYSTEM
Since
Google is the most popular search engine, many webmasters
have become eager to influence their website's Google rankings. An industry of
consultants has arisen to help websites increase their rankings on Google and
on other search engines. This field, called search engine optimization, attempts to
discern patterns in search engine listings, and then develop a methodology for
improving rankings to draw more searchers to their client's sites.
PROPOSED
SYSTEM
Apart from the problems of scaling
traditional search techniques to data of this magnitude, there are new
technical challenges involved with using the additional information present in
hypertext to product better search results. Fast crawling technology is needed
to gather the Web documents and keep them up to date. Storage space must be
used efficiently to store indices and, optionally, the documents themselves.
The indexing system must process hundreds of gigabytes of data efficiently.
Queries must be handled quickly, at the rate of hundreds to thousands per
second.
No comments:
Post a Comment