amazon

Sunday, August 16, 2015

THE GOOGLE SIMILARITY DISTANCE




THE GOOGLE SIMILARITY DISTANCE

ABSTRACT

            Words and phrases acquire meaning from the way they are used in society, from their relative semantics to other words and phrases. For computers the equivalent of ‘society’ is ‘database,’ and the equivalent of ‘use’ is ‘way to search the database.’ We present a new theory of similarity between words and phrases based on information distance and Kolmogorov complexity.
            To fix thoughts we use the world-wide-web as database, and Google as search engine. The method is also applicable to other search engines and databases. This theory is then applied to construct a method to automatically extract similarity, the Google similarity distance, of words and phrases from the world-wide web using Google page counts.
            The world-wide-web is the largest database on earth, and the context information entered by millions of independent users averages out to provide automatic semantics of useful quality. We give applications in hierarchical clustering, classification, and language translation.
            We give examples to distinguish between colors and numbers, cluster names of paintings by 17th century Dutch masters and names of books by English novelists, the ability to understand emergencies, and primes, and we demonstrate the ability to do a simple automatic English-Spanish translation. Finally, we use the WorldNet database as an objective baseline against which to judge the performance of our method. We conduct a massive randomized trial in binary classification using support vector machines to learn categories based on our Google distance, resulting in an a mean agreement of 87% with the expert crafted WorldNet categories.



system analysis

EXISTING SYSTEM
                             Since Google is the most popular search engine, many webmasters have become eager to influence their website's Google rankings. An industry of consultants has arisen to help websites increase their rankings on Google and on other search engines. This field, called search engine optimization, attempts to discern patterns in search engine listings, and then develop a methodology for improving rankings to draw more searchers to their client's sites.
PROPOSED SYSTEM
                   Apart from the problems of scaling traditional search techniques to data of this Magnitude, there are new technical challenges involved with using the additional information present in hypertext to product better search results.... Fast crawling technology is needed to gather the Web documents and keep them up to date. Storage space must be used efficiently to store indices and, optionally, the documents themselves. The indexing system must process hundreds of gigabytes of data efficiently. Queries must be handled quickly, at the rate of hundreds to thousands per second.
                   RESOURCE REQUIREMENT

Hardware Requirements:
                    Processor                        :   Pentium 4
            Processor Speed            :   2.40GHz
            RAM                             :   512 MB
            Hard Disk                      :  80GB
            CD Drive                       :  Samsung 52X
Software Requirements:

          Environment                  :   Visual studio .NET 2005
          .NET Framework           :    VERSION 2.0
          Language                       :   ASP.NET with C#
           Operating System         :   WINDOWS 2000/XP
           Back End                      :   SQL SERVER 2000






No comments:

Post a Comment