Data Mining with
Big Data
AIM:
The aim is to propose a Big Data processing
with respect Data mining perspective. Here,Data comes from everywhere like
sensors, media sites and social media etc. In this useful data can be extracted
from this big data using datamining technique for discovering interesting
patterns.
Introduction:
With the recent development of IT
technology, the capacity of data has surpassed the Zetta-byte, and improving
the efficiency of business by increasing the predictive ability through an
efficient analysis on these data has emerged as an issue of the current
society.
HACE
theorem suggests that the key characteristics of the Big Data are
Huge
with heterogeneous and diverse data sources:
One of the fundamental
characteristics of the Big Data is the huge volume of data represented by
heterogeneous and diverse dimensionalities. This huge volume of data comes from
various sites like Twitter, MySpace, Orkut and LinkedIn etc.
Decentralized control:
Autonomous data sources with distributed and decentralized controls are a main
characteristic of Big Data applications. Being autonomous, each data source is
able to generate and collect information without involving (or relying on) any
centralized control.
This is similar to the World
Wide Web (WWW) setting where each web server provides a certain amount of
information and each server is able to fully function without necessarily
relying on other servers.
Complex data and
knowledge associations Multistructure, multisource data is complex data
Examples of complex data types are bills of materials, word processing
documents, maps, time series, images and video.
Such combined characteristics suggest that Big
Data require a “big mind” to consolidate data for maximum value.
Such autonomous sources are not
only the solutions of the technical designs, but also the results of the legislation
and the regulation rules in different countries/ regions. For example, Asian
markets of Wal-Mart are inherently different from its North American markets in
terms of seasonal promotions, top sell items, and customer behaviors.
More specifically, the
local government regulations also impact on the wholesale management process and
result in restructured data representations and data warehouses for local
markets.
Existing
System:
The main data mining
systems drawback is it requires computational intensive units for data
analysis, therefore it requires two types of resources data and computing processors.
There exist many methods
for classifying big data by aligning, comparing, and analyzing using such algorithms
as decision tree and K-means. These methods are, however, not always effective
are more conserved in structures than in sequence.
The rise of Big Data
applications where data collection has grown tremendously and is beyond the
ability of commonly used software tools to capture, manage, and process within
a “tolerable elapsed time.”
The most fundamental
challenge for Big Data applications is to explore the large volumes of data and
extract useful information or knowledge for future actions.
In many situations, the
knowledge extraction process has to be very efficient and close to real time
because storing all observed data is nearly infeasible.
Proposed
System:
We propose a HACE
theorem to model Big Data characteristics. The characteristics of HACH make it
an extreme challenge for discovering useful knowledge from the Big Data.
The HACE
theorem suggests that the key characteristics of the Big Data are
- Huge with heterogeneous and
diverse data sources.
- Autonomous with distributed and
decentralized control, and Complex
and Evolving in data and
knowledge associations.
- To support Big Data mining, high-performance
computing platforms are required, which impose systematic designs to
unleash the full power of the Big Data.
Problem
Definition:
The properties of the
system are: robust and fault tolerant, scalable, general, and extensible,
allows ad hoc queries, minimal maintenance, and debuggable.
- Statistical significance: It is important to
achieve significant statistical results, and not be fooled by randomness.
- Distributed mining: Many data mining techniques
are not trivial to paralyze. To have distributed versions of some methods,
a lot of research is needed with practical and theoretical analysis to
provide new methods.
- Hidden Big Data: Large quantities of useful
data are getting lost since new data is largely untagged file based and
unstructured data. The 2012 IDC study on Big Data explains that in 2012,
23% (643 Exabyte’s) of the digital universe would be useful for Big Data
if tagged and analyzed.
- However,
currently only 3% of the potentially useful data is tagged, and even less
is analyzed For an intelligent learning database system to handle Big Data,
the essential key is to scale up to the exceptionally large volume of data
and provide treatments for the characteristics featured by the
aforementioned HACE theorem. This shows a conceptual view of the Big Data
processing framework, which includes three tiers from inside out with
considerations on data accessing and computing (Tier I), data privacy and
domain knowledge (Tier II), and Big Data mining algorithms (Tier III).
Detail Design:
Twitter
Data Generation:
Twitter is a highly
popular platform for information exchange, can be used as a data-mining source
which could aid in the aforementioned challenges which is collected by sensor
nodes. Specifically, using a large data set of harvested tweets, sensor nodes
connect with sink to transfer the dataset to HDFS system.The REST APIs provides
programmatic access to read and write Twitter data. Author a new Tweet, read
author profile and follower data, and more. The REST API identifies Twitter
applications and users using OAuth, responses are available in JSON.
Exploration:
This
stage usually starts with data preparation which may involve cleaning data,
data transformations, and selecting subsets of records and in case of data sets
with large numbers of variables (“fields”) performing some preliminary feature
selection operations to bring the number of variables to a manageable. Making
use of complex data is a major challenge for Big Data applications, because any
two parties in a complex network are potentially interested to each other with
a social connection. Such a connection is quadratic with respect to the number
of nodes in the network, so a million node networks may be subject to one trillion
connections.
Data
collection:
In this collected data
is transferred to HDFS system and spectral clustering is performed to perform
data analytics based on the Hash tag, Location and retweet count. As Big Data
applications are featured with autonomous sources and decentralized controls,
aggregating distributed data sources to a centralized site for mining is
systematically prohibitive due to the potential transmission cost and privacy
concerns. On the other hand, although we can always carry out mining activities
at each distributed site, the biased view of the data collected at each site
often leads to biased decisions or models.
Spectral
Clustering:
For this analysis, we
used different user lists on twitter as the ground truth data for a group of users.
We obtained all the tweets from the users who were listed in the lists and then
tried to obtain clusters by using different similarity metrics using the
spectral clustering algorithm. In addition to this, we also explore different
connections between users in addition to just the social connections in order
to find out other features that affect the users being listed together. We
present results of applying spectral clustering algorithm using the modularity
matrix and the symmetric normalized Laplacian matrix. We compare the results of
these approaches while using several different input matrices formed by
different combination of the above similarity measures.
SYSTEM CONFIGURATION:
HARDWARE CONFIGURATION:
Processor -
Pentium IV
Speed - 1.1 GHz
RAM - 512 MB (min)
Hard
Disk - 20GB
Keyboard -
Standard Keyboard
Mouse - Two or Three Button Mouse
Monitor -
LCD/LED Monitor
SOFTWARE CONFIGURATION:
Operating System -
Windows XP/7
Programming Language - Java/J2EE
Software Version -
JDK 1.7 or above
Conclusion:
Big data is the term for
a collection of complex data sets, Data mining is an analytic process designed
to explore data(usually large amount of data typically business or market related
also known as “big data”)in search of consistent patterns and then to validate
the findings by applying the detected patterns to new subsets of data.
To support big data mining, high performance
computing platforms are required, which impose systematic designs to unleash
the full power of the Big Data. We regard big data as an emerging trend and the
need for Big Data mining is rising in all science and engineering domains. With
Big data technologies, we will hopefully be able to provide most relevant and
most accurate social sensing feedback to better understand our society at real
time.
No comments:
Post a Comment