Data Mining with Big Data
ABSTRACT:
Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.
EXISTING SYSTEM:
Ø The rise of Big Data applications where data collection has grown tremen dously and is beyond the ability of commonly used software tools to capture, manage, and process within a “tolerable elapsed time.” The most fundamental challenge for Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions. In many situations, the knowledge extraction process has to be very efficient and close to real time because storing all observed data is nearly infeasible.
Ø The unprecedented data volumes require an effective data analysis and prediction platform to achieve fast response and real-time classification for such Big Data.
DISADVANTAGES OF EXISTING SYSTEM:
] The challenges at Tier I focus on data accessing and arithmetic computing procedures. Because Big Data are often stored at different locations and data volumes may continuously grow, an effective computing platform will have to take distributed large-scale data storage into consideration for computing.
] The challenges at Tier II center around semantics and domain knowledge for different Big Data applications. Such information can provide additional benefits to the mining process, as well as add technical barriers to the Big Data access (Tier I) and mining algorithms (Tier III).
] At Tier III, the data mining challenges concentrate on algorithm designs in tackling the difficulties raised by the Big Data volumes, distributed data distributions, and by complex and dynamic data characteristics.
PROPOSED SYSTEM:
Ø We propose a HACE theorem to model Big Data characteristics. The characteristics of HACH make it an extreme challenge for discovering useful knowledge from the Big Data.
Ø The HACE theorem suggests that the key characteristics of the Big Data are 1) huge with heterogeneous and diverse data sources, 2) autonomous with distributed and decentralized control, and 3) complex and evolving in data and knowledge associations.
Ø To support Big Data mining, high-performance computing platforms are required, which impose systematic designs to unleash the full power of the Big Data.
ADVANTAGES OF PROPOSED SYSTEM:
Provide most relevant and most accurate social sensing feedback to better understand our society at realtime.
SYSTEM ARCHITECTURE:
SYSTEM CONFIGURATION:
HARDWARE CONFIGURATION:
] Processor - Pentium IV
] Speed - 1.1 Ghz
] RAM - 512 MB (min)
] Hard Disk - 20GB
] Keyboard - Standard Keyboard
] Mouse - Two or Three Button Mouse
] Monitor - LCD/LED Monitor
SOFTWARE CONFIGURATION:
ü Operating System - Windows XP/7
ü Programming Language - Java/J2EE
ü Software Version - JDK 1.7 or above
ü Database - MYSQL
REFERENCE:
Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing Wu, and Wei Ding, Senior Member, IEEE, “Data Mining with Big Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014.
Discovering Emerging Topics in Social Streams via Link-Anomaly Detection
ABSTRACT:
Detection of emerging topics is now receiving renewed interest motivated by the rapid growth of social networks. Conventional-term-frequency-based approaches may not be appropriate in this context, because the information exchanged in social-network posts include not only text but also images, URLs, and videos. We focus on emergence of topics signaled by social aspects of theses networks. Specifically, we focus on mentions of user links between users that are generated dynamically (intentionally or unintentionally) through replies, mentions, and retweets. We propose a probability model of the mentioning behavior of a social network user, and propose to detect the emergence of a new topic from the anomalies measured through the model. Aggregating anomaly scores from hundreds of users, we show that we can detect emerging topics only based on the reply/mention relationships in social-network posts. We demonstrate our technique in several real data sets we gathered from Twitter. The experiments show that the proposed mention-anomaly-based approaches can detect new topics at least as early as text-anomaly-based approaches, and in some cases much earlier when the topic is poorly identified by the textual contents in posts.
EXISTING SYSTEM:
Ø A new (emerging) topic is something people feel like discussing, commenting, or forwarding the information further to their friends. Conventional approaches for topic detection have mainly been concerned with the frequencies of (textual) words.
DISADVANTAGES OF EXISTING SYSTEM:
A term-frequency-based approach could suffer from the ambiguity caused by synonyms or homonyms. It may also require complicated preprocessing (e.g., segmentation) depending on the target language. Moreover, it cannot be applied when the contents of the messages are mostly nontextual information. On the other hand, the “words” formed by mentions are unique, require little preprocessing to obtain (the information is often separated from the contents), and are available regardless of the nature of the contents.
PROPOSED SYSTEM:
Ø In this paper, we have proposed a new approach to detect the emergence of topics in a social network stream.
Ø The basic idea of our approach is to focus on the social aspect of the posts reflected in the mentioning behavior of users instead of the textual contents.
Ø We have proposed a probability model that captures both the number of mentions per post and the frequency of mentionee.
ADVANTAGES OF PROPOSED SYSTEM:
Ø The proposed method does not rely on the textual contents of social network posts, it is robust to rephrasing and it can be applied to the case where topics are concerned with information other than texts, such as images, video, audio, and so on.
Ø The proposed link-anomaly-based methods performed even better than the keyword-based methods on “NASA” and “BBC” data sets.
SYSTEM ARCHITECTURE:
SYSTEM REQUIREMENTS:
HARDWARE REQUIREMENTS:
Ø System : Pentium IV 2.4 GHz.
Ø Hard Disk : 40 GB.
Ø Floppy Drive : 1.44 Mb.
Ø Monitor : 15 VGA Colour.
Ø Mouse : Logitech.
Ø Ram : 512 Mb.
SOFTWARE REQUIREMENTS:
Ø Operating system : Windows XP/7.
Ø Coding Language : JAVA/J2EE
Ø IDE : Netbeans 7.4
Ø Database : MYSQL
REFERENCE:
Toshimitsu Takahashi, Ryota Tomioka, and Kenji Yamanishi, Member, IEEE,“Discovering Emerging Topics in Social Streams via Link-Anomaly Detection”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014.
No comments:
Post a Comment