amazon

Saturday, August 22, 2015

DATA MINING WITH BIG DATA

Data Mining with Big Data

AIM:  
             The aim is to propose a Big Data processing with respect Data mining perspective. Here,Data comes from everywhere like sensors, media sites and social media etc. In this useful data can be extracted from this big data using datamining technique for discovering interesting patterns.

Introduction:

            With the recent development of IT technology, the capacity of data has surpassed the Zetta-byte, and improving the efficiency of business by increasing the predictive ability through an efficient analysis on these data has emerged as an issue of the current society.
HACE theorem suggests that the key characteristics of the Big Data are

Huge with heterogeneous and diverse data sources:

                        One of the fundamental characteristics of the Big Data is the huge volume of data represented by heterogeneous and diverse dimensionalities. This huge volume of data comes from various sites like Twitter, MySpace, Orkut and LinkedIn etc.

                        Decentralized control: Autonomous data sources with distributed and decentralized controls are a main characteristic of Big Data applications. Being autonomous, each data source is able to generate and collect information without involving (or relying on) any centralized control.
                       
                        This is similar to the World Wide Web (WWW) setting where each web server provides a certain amount of information and each server is able to fully function without necessarily relying on other servers.

                        Complex data and knowledge associations Multistructure, multisource data is complex data Examples of complex data types are bills of materials, word processing documents, maps, time series, images and video.
                       
                         Such combined characteristics suggest that Big Data require a “big mind” to consolidate data for maximum value.
                       
                        Such autonomous sources are not only the solutions of the technical designs, but also the results of the legislation and the regulation rules in different countries/ regions. For example, Asian markets of Wal-Mart are inherently different from its North American markets in terms of seasonal promotions, top sell items, and customer behaviors.
           
                        More specifically, the local government regulations also impact on the wholesale management process and result in restructured data representations and data warehouses for local markets.
           




Existing System:

                        The main data mining systems drawback is it requires computational intensive units for data analysis, therefore it requires two types of resources data and computing processors.
                       
                        There exist many methods for classifying big data by aligning, comparing, and analyzing using such algorithms as decision tree and K-means. These methods are, however, not always effective are more conserved in structures than in sequence.

                        The rise of Big Data applications where data collection has grown tremendously and is beyond the ability of commonly used software tools to capture, manage, and process within a “tolerable elapsed time.”
                        The most fundamental challenge for Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions.
                       
                        In many situations, the knowledge extraction process has to be very efficient and close to real time because storing all observed data is nearly infeasible.



Proposed System:

                        We propose a HACE theorem to model Big Data characteristics. The characteristics of HACH make it an extreme challenge for discovering useful knowledge from the Big Data.
The HACE theorem suggests that the key characteristics of the Big Data are
  • Huge with heterogeneous and diverse data sources.
  • Autonomous with distributed and decentralized control, and Complex and Evolving in data and knowledge associations.
  • To support Big Data mining, high-performance computing platforms are required, which impose systematic designs to unleash the full power of the Big Data.


Problem Definition:

                        The properties of the system are: robust and fault tolerant, scalable, general, and extensible, allows ad hoc queries, minimal maintenance, and debuggable.

  • Statistical significance: It is important to achieve significant statistical results, and not be fooled by randomness.

  • Distributed mining: Many data mining techniques are not trivial to paralyze. To have distributed versions of some methods, a lot of research is needed with practical and theoretical analysis to provide new methods.

  • Hidden Big Data: Large quantities of useful data are getting lost since new data is largely untagged file based and unstructured data. The 2012 IDC study on Big Data explains that in 2012, 23% (643 Exabyte’s) of the digital universe would be useful for Big Data if tagged and analyzed.

  •  However, currently only 3% of the potentially useful data is tagged, and even less is analyzed For an intelligent learning database system to handle Big Data, the essential key is to scale up to the exceptionally large volume of data and provide treatments for the characteristics featured by the aforementioned HACE theorem. This shows a conceptual view of the Big Data processing framework, which includes three tiers from inside out with considerations on data accessing and computing (Tier I), data privacy and domain knowledge (Tier II), and Big Data mining algorithms (Tier III).

Detail Design:

Twitter Data Generation:

                        Twitter is a highly popular platform for information exchange, can be used as a data-mining source which could aid in the aforementioned challenges which is collected by sensor nodes. Specifically, using a large data set of harvested tweets, sensor nodes connect with sink to transfer the dataset to HDFS system.The REST APIs provides programmatic access to read and write Twitter data. Author a new Tweet, read author profile and follower data, and more. The REST API identifies Twitter applications and users using OAuth, responses are available in JSON.

Exploration:

                        This stage usually starts with data preparation which may involve cleaning data, data transformations, and selecting subsets of records and in case of data sets with large numbers of variables (“fields”) performing some preliminary feature selection operations to bring the number of variables to a manageable. Making use of complex data is a major challenge for Big Data applications, because any two parties in a complex network are potentially interested to each other with a social connection. Such a connection is quadratic with respect to the number of nodes in the network, so a million node networks may be subject to one trillion connections.

Data collection:

                        In this collected data is transferred to HDFS system and spectral clustering is performed to perform data analytics based on the Hash tag, Location and retweet count. As Big Data applications are featured with autonomous sources and decentralized controls, aggregating distributed data sources to a centralized site for mining is systematically prohibitive due to the potential transmission cost and privacy concerns. On the other hand, although we can always carry out mining activities at each distributed site, the biased view of the data collected at each site often leads to biased decisions or models.

Spectral Clustering:

                        For this analysis, we used different user lists on twitter as the ground truth data for a group of users. We obtained all the tweets from the users who were listed in the lists and then tried to obtain clusters by using different similarity metrics using the spectral clustering algorithm. In addition to this, we also explore different connections between users in addition to just the social connections in order to find out other features that affect the users being listed together. We present results of applying spectral clustering algorithm using the modularity matrix and the symmetric normalized Laplacian matrix. We compare the results of these approaches while using several different input matrices formed by different combination of the above similarity measures.
 SYSTEM CONFIGURATION:
                         HARDWARE CONFIGURATION:
                       
                        Processor    -        Pentium IV
                        Speed          -        1.1 GHz
                        RAM          -        512 MB (min)
                        Hard Disk   -        20GB
                        Keyboard    -        Standard Keyboard
                        Mouse         -        Two or Three Button Mouse
                        Monitor      -        LCD/LED Monitor

SOFTWARE CONFIGURATION:
           
            Operating System                    -        Windows XP/7
            Programming Language          -        Java/J2EE
            Software Version                     -        JDK 1.7 or above






Conclusion:

                        Big data is the term for a collection of complex data sets, Data mining is an analytic process designed to explore data(usually large amount of data typically business or market related also known as “big data”)in search of consistent patterns and then to validate the findings by applying the detected patterns to new subsets of data.

                         To support big data mining, high performance computing platforms are required, which impose systematic designs to unleash the full power of the Big Data. We regard big data as an emerging trend and the need for Big Data mining is rising in all science and engineering domains. With Big data technologies, we will hopefully be able to provide most relevant and most accurate social sensing feedback to better understand our society at real time.

No comments:

Post a Comment