amazon

Saturday, August 22, 2015

ADAPTIVE JOIN OPERATORS FOR RESULT RATE OPTIMIZATION ON STREAMING INPUTS

Adaptive Join Operators for Result Rate
Optimization on Streaming Inputs

INTRODUCTION
1.     GENERAL:
        MODERN information processing is moving into a realm where we often need to process data that are pushed or pulled from autonomous data sources through heterogeneous Networks. Adaptive query processing has emerged as an answer to the problems that arise because of the fluidity and unpredictability of data arrivals in such environments [1]. An important line of research in adaptive query processing has been toward developing join algorithms that can produce tuples “online,” from streaming, partially available input relations, or while waiting for one or more inputs [4], [7], [9], [12], [14], [15]. Such non blocking join behavior can improve pipelining by smoothing or “masking” varying data arrival rates and can generate join results with high rates, thus, improving performance in a variety of query processing scenarios in data integration, online aggregation, and approximate query answering systems. Compared to traditional join algorithms (be they sort-, hash-, or nested-loop-based [13]), adaptive joins are designed to deal with some additional challenges: The input relations they use are provided by external network sources. The implication is that one has little or no control over the order or rate of arrival of tuples. Since the data source reply speed, streaming rate and streaming order, as well as network traffic and congestion levels, are unpredictable, traditional join algorithms are often unsuitable or inefficient. For example, most traditional join algorithms cannot produce results until at least one of the relations is completely available. Waiting for one relation to arrive completely to produce results is often unacceptable. Moreover, and more importantly, in emerging data integration or online aggregation environments, a key performance metric is rapid availability of first results and a continuous rate of tuple production. Adaptive join algorithms were created in order to lift the limitations of traditional join algorithms in such environments. By being able to produce results whenever input tuples become available, adaptive join algorithms overcome situations like initial delay, slow data delivery, or busty arrival, which can affect the efficiency of the join. All existing algorithms work in three stages. During the Arriving phase, a newly arrived tuple is stored in memory and it is matched against memory-resident tuples belonging to the other relations participating in the join. Since the allocated memory for the join operation is limited and often much smaller than the volume of the incoming data, this
Results in tuple migration to disk. The decision on what to flush to disk influences significantly the number of results produced during the Arriving phase. The Arriving phase is suspended when all data sources are temporarily blocked
And a Reactive phase kicks in and starts joining part of the tuples that have been flushed to disk. An important desideratum of this phase is the prompt handover to the Arriving phase as soon as any of the data sources restarts sending tuples. Each algorithm has a handover delay which depends on the minimum unit of work that needs to be completed before switching phases. This delay has not received attention in the past, but we show that it can easily lead to input buffer overflow, lost tuples, and hence incorrect results. When all sources complete the data transmission, a Cleanup phase is activated and the tuples that were not joined in the previous phases (due to flushing of tuples to disk) are brought from disk and joined. Even if the overall strategy has been proposed for a multiday join, most existing algorithms are limited to a two-way join. Devising an effective multiday adaptive join operator is a challenge in which little progress has been made [15].
In this paper, we propose two new adaptive join algorithms for output rate maximization in data processing over autonomous distributed sources. The first algorithm, Double Index Nested-loop Reactive join (DINER) is applicable for two inputs; while Multiple Index Nested-loop Reactive join (MINER) can be used for joining an arbitrary number of input sources. DINER follows the same overall pattern of execution discussed earlier but incorporates a series of novel techniques and ideas that make it faster, leaner (in memory use), and more adaptive than its predecessors. More specifically, the key differences between DINER and existing algorithms are 1) an intuitive flushing policy for the Arriving phase that aims at maximizing the amount of overlap of the join attribute values between memory resident tuples of the two relations and 2) a lightweight Reactive phase that allows the algorithm to quickly move into processing tuples that were flushed to disk when both data sources block. The key idea of our flushing policy is to create and adaptively maintain three non-overlapping value regions that partition the join attribute domain, measure the “join benefit” of each region at every flushing decision point, and flush tuples from the region that doesn’t produce Many join results in a way that permits easy maintenance of the three-way partition of the values. As will be explained, proper management of the three partitions allows us to increase the number of tuples with matching values on their join attribute from the two relations, thus, maximizing the output rate. When tuples are flushed to disk they are organized into sorted blocks using an efficient index structure, maintained separately for each relation (thus, the part “Double Index” in DINER). This optimization results in faster processing of these tuples during the Reactive and Cleanup phases. The Reactive phase of DINER employs a symmetric nested loop join process, combined with novel bookkeeping that allows the algorithm to react to the unpredictability of the data sources. The fusion of the two techniques allows DINER to make much more efficient use of available main memory.
 We demonstrate in our experiments that DINER has a higher rate of join result production and is much more adaptive to changes in the environment, including changes in the value distributions of the streamed tuples and in their arrival rates. MINER extends DINER to multiday joins and it maintains all the distinctive and efficiency generating properties of DINER. MINER maximizes the output rate by: 1) adopting an efficient probing sequence for new incoming tuples which aims to reduce the processing overhead by interrupting index lookups early for those tuples that do not participate in the overall result; 2) applying an effective flushing policy that keeps in memory the tuples that produce results, in a manner similar to DINER; and 3) activating a Reactive phase when all inputs are blocked, which joins on-disk tuples while keeping the result correct and being able to promptly hand over in the presence of new input. Compared to DINER, MINER faces additional challenges namely: 1) updating and synchronizing the statistics for each join attribute during the online phase, and 2) more complicated bookkeeping in order to be able to guarantee correctness and prompt handover during reactive phase. We are able to generalize the reactive phase of DINER for multiple inputs using a novel scheme that dynamically changes the roles between relations. An important feature of both DINER and MINER is that they are the first adaptive, completely unblocking join techniques that support range join conditions. Range join queries are a very common class of joins in a variety of applications, from traditional business data processing to financial analysis applications and spatial data processing. Progressive Merge Join (PMJ) [4], one of the early adaptive algorithms, also supports range conditions, but its blocking behavior makes it a poor solution given the requirements of current data integration scenarios. Our experiments show that PMJ is outperformed by DINER.


The contributions of our paper are:                                     
                    We introduce DINER a novel adaptive join algorithm that supports both equality and ranges join predicates. DINER builds on an intuitive flushing policy that aims at maximizing the productivity of tuples that are kept in memory. DINER is the first algorithm to address the need to quickly respond to bursts of arriving data during the Reactive phase. We propose a novel extension to nested loops join for processing disk-resident tuples when both sources block, while being able to swiftly respond to new data arrivals. We introduce MINER, a novel adaptive multiday join algorithm that maximizes the output rate, designed for dealing with cases where data are held by multiple remote sources. . We provide a thorough discussion of existing algorithms, including identifying some important limitations, such as increased memory consumption because of their inability to quickly switch to the Arriving phase and not being responsive enough when value distributions change. . We provide an extensive experimental study of DINER including performance comparisons to existing adaptive join algorithms and a sensitivity analysis. Our results demonstrate the superiority of DINER in a variety of realistic scenarios. During the online phase of the algorithm, DINER manages to produce up to three times more results compared to previous techniques. The performance gains of DINER are realized when using both real and synthetic data and are increased when fewer resources (memory) are given to the algorithm. We also evaluate the performance of MINER, and we show that it is still possible to obtain early a large percentage of results even in more elaborated setups where data are provided through multiple inputs. Our experimental study shows that the performance of MINER is 60 times higher compared to the existing Miner Join algorithm when a four-way star join is executed in a constrained memory environment.
                   The rest of the paper is organized as follows: Section 2 presents prior work in the area. In Section 3, we introduce the DINER algorithm and discuss in detail its operations. MINER is analyzed in Section 4. Section 5 contains our Experiments, and Section 6 contains concluding remarks.
Existing system:
    1. The input relations they use are provided by external network sources. The implication is that one has little or no control over the order or rate of arrival of tuples.
     2. Since the data source reply speed, streaming rate and streaming order, as well as network traffic and congestion levels, are unpredictable,
    3.  Traditional join algorithms are often unsuitable or inefficient.
Proposal System:
    1. Adaptive join algorithms were created in order to lift the limitations of traditional join algorithms in such environments.
    2.  By being able to produce results whenever input tuples become available, adaptive join algorithms overcome situations like initial delay, slow data delivery, or busty arrival, which can affect the efficiency of the join.
    3. We propose DINER and MINER algorithm for result rate maximization
Why adaptive joins?
Ø Data provided by autonomous sources
Ø Data is transported through unreliable network
Ø Data arrival rate cannot be controlled
Ø Limited memory budget at the local site
Ø Producing results as soon as first input tuple is available
Ø Smoothing of join result production
Ø Masking source and network delay
Diner: Novel, Better performance, Adaptive join
Ø An intuitive  and effective flushing policy
Ø Introducing a more responsive reactive phase
Ø Takes advantage of source blocking and continues to produce results
Ø Prompt handover when new input tuples arrive
Ø Significantly higher result rate compared to existing algorithms
Ø A learner algorithm-better performance in more constrained memory environment
Ø Supports equijoins and range queries





No comments:

Post a Comment