Xtream Technologies: THE GOOGLE SIMILARITY DISTANCE

THE GOOGLE SIMILARITY DISTANCE

ABSTRACT

Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of ‘society’ is ‘database,’ and the equivalent of ‘use’ is ‘way to search the database.’ We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity. To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide web using Google page counts.

The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation. We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WorldNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WorldNet categories.

INTRODUCTION

Objects can be given literally, like the literal four-letter genome of a mouse, or the literal text of war and peace by Tolstoy. For simplicity we take it that all meaning of the object is represented by the literal object itself. Objects can also be given by name, like “the four-letter genome of a mouse,” or “the text of war and peace by tolstoy.” there are also objects that cannot be given literally, but only by name, and that acquire their meaning from their contexts in background common knowledge in humankind, like “home” or “red.” to make computers more intelligent one would like to represent meaning in computer-digestible form. Long-term and labor-intensive efforts like the cycle project and the word net project try to establish semantic relations between common objects, or, more precisely, names for those objects. The idea is to create a semantic web of such vast proportions that rudimentary intelligence, and knowledge about the real world, spontaneously emerge. This comes at the great cost of designing structures capable of manipulating knowledge, and entering high quality contents in these structures by knowledgeable human experts.

While the efforts are long running and large scale, the overall information entered is minute compared to what is available on the world-wide web. The rise of the world-wide-web has enticed millions of users to type in trillions of characters to create billions of web pages of on average low quality contents. The sheer mass of the information about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense. We devise a general method to tap the amorphous low-grade knowledge available for free on the world-wide web, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries, like Google.

PROJECT DESCRIPTION

This project having Seven Modules.

1. User Page

2. Home Page

3. Adding New Keyword

4. New user registration

5. User login

6. Feedback

7. Search engine

Module 1: USER PAGE

This is the First Module. Module Name is User Page. User can go with any link such as Home Page, Books, Images, Maps, Search Keywords and News.

Module 2: HOME PAGE

Home Page is second Module of this project. User can enter into this Page and giving any corresponding Page informative Keywords to Textbox. Then He will submit it. That keyword is search in table called Google and display that corresponding Information to this web Page.

Module 3: ADDING NEW KEYWORD

Third Module of this Project is Adding New Keyword. Admin can have permission to add keywords to Google table.Admin can enter keyword, data and author name to Google table.

Module 4: NEW USER REGISTRATION

Fourth Module of this Project is New User Registration. User can enter user details such as Use rid Username, Password, Conform Password, Gender, and Address. These details are stored in New Forms database.

Module 5: USER LOGIN

This Module is User Login. This is our Fifth Module Of this project. User can give valid username and password that details checked and valid user only can give keyword, data and author name that details are stored in Google table.

Module 6: FEEDBACK

Sixth Module of this project is Feedback. Use can give any type of feedback, for that he/she will provide their user id, username and description. At last they submit this form. It will be stored in Feedback details database.

Module 7: SEARCH ENGINE

This module is search Engine Module. User can utilize search Engine Process. They will give keyword that will be searched by search engine then the result will be displayed from database.

RESOURCE REQUIREMENT

Hardware Requirements:

Processor : Pentium 4

Processor Speed : 2.40GHz

RAM : 512 MB

Hard Disk : 80GB

CD Drive : Samsung 52X

Software Requirements:

Environment : Visual studio .NET 2005

.NET Framework : VERSION 2.0

Language : ASP.NET with C#

Operating System : WINDOWS 2000/XP

Back End : SQL SERVER 2000

system analysis

EXISTING SYSTEM

Since Google is the most popular search engine, many webmasters have become eager to influence their website's Google rankings. An industry of consultants has arisen to help websites increase their rankings on Google and on other search engines. This field, called search engine optimization, attempts to discern patterns in search engine listings, and then develop a methodology for improving rankings to draw more searchers to their client's sites.

PROPOSED SYSTEM

Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to product better search results. Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at the rate of hundreds to thousands per second.

Pages

amazon

Friday, August 14, 2015

THE GOOGLE SIMILARITY DISTANCE

No comments:

Post a Comment