Adaptive
Join Operators for Result Rate
Optimization
on Streaming Inputs
INTRODUCTION
1. GENERAL:
MODERN information processing is moving
into a realm where we often need to process data that are pushed or pulled from
autonomous data sources through heterogeneous Networks. Adaptive query
processing has emerged as an answer to the problems that arise because of the
fluidity and unpredictability of data arrivals in such environments [1]. An
important line of research in adaptive query processing has been toward
developing join algorithms that can produce tuples “online,” from streaming,
partially available input relations, or while waiting for one or more inputs
[4], [7], [9], [12], [14], [15]. Such non blocking join behavior can improve
pipelining by smoothing or “masking” varying data arrival rates and can
generate join results with high rates, thus, improving performance in a variety
of query processing scenarios in data integration, online aggregation, and
approximate query answering systems. Compared to traditional join algorithms
(be they sort-, hash-, or nested-loop-based [13]), adaptive joins are designed
to deal with some additional challenges: The input relations they use are
provided by external network sources. The implication is that one has little or
no control over the order or rate of arrival of tuples. Since the data source
reply speed, streaming rate and streaming order, as well as network traffic and
congestion levels, are unpredictable, traditional join algorithms are often
unsuitable or inefficient. For example, most traditional join algorithms cannot
produce results until at least one of the relations is completely available.
Waiting for one relation to arrive completely to produce results is often
unacceptable. Moreover, and more importantly, in emerging data integration or
online aggregation environments, a key performance metric is rapid availability
of first results and a continuous rate of tuple production. Adaptive join
algorithms were created in order to lift the limitations of traditional join
algorithms in such environments. By being able to produce results whenever
input tuples become available, adaptive join algorithms overcome situations
like initial delay, slow data delivery, or busty arrival, which can affect the
efficiency of the join. All existing algorithms work in three stages. During
the Arriving phase, a newly arrived tuple is stored in memory and it is matched
against memory-resident tuples belonging to the other relations participating
in the join. Since the allocated memory for the join operation is limited and
often much smaller than the volume of the incoming data, this
Results
in tuple migration to disk. The decision on what to flush to disk influences
significantly the number of results produced during the Arriving phase. The
Arriving phase is suspended when all data sources are temporarily blocked
And
a Reactive phase kicks in and starts joining part of the tuples that have been
flushed to disk. An important desideratum of this phase is the prompt handover
to the Arriving phase as soon as any of the data sources restarts sending
tuples. Each algorithm has a handover delay which depends on the minimum unit
of work that needs to be completed before switching phases. This delay has not
received attention in the past, but we show that it can easily lead to input
buffer overflow, lost tuples, and hence incorrect results. When all sources
complete the data transmission, a Cleanup phase is activated and the tuples
that were not joined in the previous phases (due to flushing of tuples to disk)
are brought from disk and joined. Even if the overall strategy has been
proposed for a multiday join, most existing algorithms are limited to a two-way
join. Devising an effective multiday adaptive join operator is a challenge in
which little progress has been made [15].
In
this paper, we propose two new adaptive join algorithms for output rate
maximization in data processing over autonomous distributed sources. The first
algorithm, Double Index Nested-loop Reactive join (DINER) is applicable for two
inputs; while Multiple Index Nested-loop Reactive join (MINER) can be used for
joining an arbitrary number of input sources. DINER follows the same overall
pattern of execution discussed earlier but incorporates a series of novel
techniques and ideas that make it faster, leaner (in memory use), and more
adaptive than its predecessors. More specifically, the key differences between
DINER and existing algorithms are 1) an intuitive flushing policy for the
Arriving phase that aims at maximizing the amount of overlap of the join
attribute values between memory resident tuples of the two relations and 2) a
lightweight Reactive phase that allows the algorithm to quickly move into
processing tuples that were flushed to disk when both data sources block. The
key idea of our flushing policy is to create and adaptively maintain three non-overlapping
value regions that partition the join attribute domain, measure the “join
benefit” of each region at every flushing decision point, and flush tuples from
the region that doesn’t produce Many join results in a way that permits easy
maintenance of the three-way partition of the values. As will be explained,
proper management of the three partitions allows us to increase the number of
tuples with matching values on their join attribute from the two relations,
thus, maximizing the output rate. When tuples are flushed to disk they are
organized into sorted blocks using an efficient index structure, maintained
separately for each relation (thus, the part “Double Index” in DINER). This
optimization results in faster processing of these tuples during the Reactive
and Cleanup phases. The Reactive phase of DINER employs a symmetric nested loop
join process, combined with novel bookkeeping that allows the algorithm to react
to the unpredictability of the data sources. The fusion of the two techniques
allows DINER to make much more efficient use of available main memory.
We demonstrate in our experiments that DINER
has a higher rate of join result production and is much more adaptive to
changes in the environment, including changes in the value distributions of the
streamed tuples and in their arrival rates. MINER extends DINER to multiday
joins and it maintains all the distinctive and efficiency generating properties
of DINER. MINER maximizes the output rate by: 1) adopting an efficient probing
sequence for new incoming tuples which aims to reduce the processing overhead
by interrupting index lookups early for those tuples that do not participate in
the overall result; 2) applying an effective flushing policy that keeps in
memory the tuples that produce results, in a manner similar to DINER; and 3)
activating a Reactive phase when all inputs are blocked, which joins on-disk
tuples while keeping the result correct and being able to promptly hand over in
the presence of new input. Compared to DINER, MINER faces additional challenges
namely: 1) updating and synchronizing the statistics for each join attribute
during the online phase, and 2) more complicated bookkeeping in order to be
able to guarantee correctness and prompt handover during reactive phase. We are
able to generalize the reactive phase of DINER for multiple inputs using a
novel scheme that dynamically changes the roles between relations. An important
feature of both DINER and MINER is that they are the first adaptive, completely
unblocking join techniques that support range join conditions. Range join
queries are a very common class of joins in a variety of applications, from
traditional business data processing to financial analysis applications and
spatial data processing. Progressive Merge Join (PMJ) [4], one of the early
adaptive algorithms, also supports range conditions, but its blocking behavior
makes it a poor solution given the requirements of current data integration
scenarios. Our experiments show that PMJ is outperformed by DINER.
The contributions of our paper are:
We introduce DINER a novel adaptive join
algorithm that supports both equality and ranges join predicates. DINER builds
on an intuitive flushing policy that aims at maximizing the productivity of
tuples that are kept in memory. DINER is the first algorithm to address the
need to quickly respond to bursts of arriving data during the Reactive phase.
We propose a novel extension to nested loops join for processing disk-resident
tuples when both sources block, while being able to swiftly respond to new data
arrivals. We introduce MINER, a novel adaptive multiday join algorithm that
maximizes the output rate, designed for dealing with cases where data are held
by multiple remote sources. . We provide a thorough discussion of existing
algorithms, including identifying some important limitations, such as increased
memory consumption because of their inability to quickly switch to the Arriving
phase and not being responsive enough when value distributions change. . We
provide an extensive experimental study of DINER including performance
comparisons to existing adaptive join algorithms and a sensitivity analysis.
Our results demonstrate the superiority of DINER in a variety of realistic
scenarios. During the online phase of the algorithm, DINER manages to produce
up to three times more results compared to previous techniques. The performance
gains of DINER are realized when using both real and synthetic data and are
increased when fewer resources (memory) are given to the algorithm. We also
evaluate the performance of MINER, and we show that it is still possible to
obtain early a large percentage of results even in more elaborated setups where
data are provided through multiple inputs. Our experimental study shows that
the performance of MINER is 60 times higher compared to the existing Miner Join algorithm when a four-way
star join is executed in a constrained memory environment.
The rest of the paper is
organized as follows: Section 2 presents prior work in the area. In Section 3,
we introduce the DINER algorithm and discuss in detail its operations. MINER is
analyzed in Section 4. Section 5 contains our Experiments, and Section 6
contains concluding remarks.
Existing
system:
1. The input relations they use are
provided by external network sources. The implication is that one has little or
no control over the order or rate of arrival of tuples.
2. Since the data source reply speed,
streaming rate and streaming order, as well as network traffic and congestion
levels, are unpredictable,
3. Traditional
join algorithms are often unsuitable or inefficient.
Proposal
System:
1. Adaptive join algorithms were created in
order to lift the limitations of traditional join algorithms in such
environments.
2.
By being able to produce results whenever input tuples become available,
adaptive join algorithms overcome situations like initial delay, slow data
delivery, or busty arrival, which can affect the efficiency of the join.
3. We propose DINER and MINER algorithm for
result rate maximization
Why adaptive joins?
Ø Data
provided by autonomous sources
Ø Data
is transported through unreliable network
Ø Data
arrival rate cannot be controlled
Ø Limited
memory budget at the local site
Ø Producing
results as soon as first input tuple is available
Ø Smoothing
of join result production
Ø Masking
source and network delay
Diner: Novel, Better performance,
Adaptive join
Ø An
intuitive and effective flushing policy
Ø Introducing
a more responsive reactive phase
Ø Takes
advantage of source blocking and continues to produce results
Ø Prompt
handover when new input tuples arrive
Ø Significantly
higher result rate compared to existing algorithms
Ø A
learner algorithm-better performance in more constrained memory environment
Ø Supports
equijoins and range queries
No comments:
Post a Comment