QUERY AWARE DETERMINIZATION OF UNCERTAIN
OBJECTS
Abstract—This
paper considers the problem of determinizing probabilistic data to enable such
data to be stored in legacy systems that accept only deterministic input.
Probabilistic data may be generated by automated data analysis/enrichment
techniques such as entity resolution, information extraction, and speech
processing. The legacy system may correspond to pre-existing web applications such
as Flickr, Picasa, etc. The goal is to generate a deterministic representation
of probabilistic data that optimizes the quality of the end-application built
on deterministic data. We explore such a determinization problem in the context
of two different data processing tasks—triggers and selection queries. We show
that approaches such as thresholding or top-1 selection traditionally used for determinization
lead to suboptimal performance for such applications. Instead, we develop a
query-aware strategy and show its advantages over existing solutions through a
comprehensive empirical evaluation over real and synthetic datasets.
EXISTING
SYSTEM:
Determinizing Probabilistic Data.
While we are not aware of any prior work that directly addresses the issue of
determinizing probabilistic data as studied in this paper, the works that are
very related to ours are . They explore how to determinize answers to a query
over a probabilistic database. In contrast, we are interested in best
deterministic representation of data (and not that of a answer to a query) so
as to continue to use existing end-applications that take only deterministic
input. The differences in the two problem settings lead to different
challenges. Authors in address a problem
that chooses the set of uncertain objects to be cleaned, in order to achieve
the best improvement in the quality of query answers. However, their goal is to
improve quality of single query, while ours is to optimize quality of overall
query workload. Also, they focus on how to select the best set of objects and
each selected object is cleaned by human clarification, whereas we determinize all
objects automatically. These differences essentially lead to different
optimization challenges. Another related area is MAP inference in graphical
model, which aims to find the assignment to each variable that jointly
maximizes the probability defined by the model. The determinization problem for
the cost-based metric can be potentially viewed as an instance of MAP inference
problem. If we view the problem that way, the challenge becomes that of
developing fast and high-quality approximate algorithm for solving the corresponding
NP-hard problem. Section 3.3 exactly provides such algorithms, heavily
optimized and tuned to specifically our problem setting. Probabilistic Data
Models. A variety of advanced probabilistic data models have been proposed in the past. Our focus
however was determinizing probabilistic objects, such as image tags and speech
output, for which the probabilistic attribute model suffices. We note that determining
probabilistic data stored in more advanced probabilistic models such as And/Xor
tree might also be interesting.
Extending our work to deal with data of such complexity remains an interesting
future direction of work.
PROPOSED
SYSTEM:
Overall, the main contributions of this
paper are:
• We
introduce the problem of determinizing probabilistic data. Given a
workload of triggers/queries, the main challenge is to find the deterministic
representation of the data which would optimize certain quality metrics of the
answer to these triggers/queries.
• We
propose a framework that solves the problem of determinization by minimizing
the expected cost of the answer to queries. We develop a branchand- bound
algorithm that finds an approximate near-optimal solution to the resulting
NP-hard problem.
• We
address the problem of determinizing a collection of objects to optimize
set-based quality metrics, such as F-measure. We develop an efficient algorithm
that reaches near-optimal quality.
• We
extend the solutions to handle a data model where mutual exclusion exists among
tags. We show that correlations among tags can be leveraged in our solutions to
get better results. We also demonstrate that our solutions are designed to
handle various types of queries.
• We
empirically demonstrate that the proposed algorithms are very efficient and
reach high-quality results that are very close to those of the optimal
solution. We also demonstrate that they are robust to small changes in the
original query workload.
Module
1
Data
Quality
Data quality refers to the level of quality of Data. There are many definitions of data quality but data is
generally considered high quality if, "they are fit for their intended
uses in operations, decision making and planning." (J. M. Juran).
Alternatively, data is deemed of high quality if it correctly represents the
real-world construct to which it refers. Furthermore, apart from these
definitions, as data volume increases, the question of internal consistency within data becomes significant, regardless of fitness for
use for any particular external purpose. The people's views on data quality can
often be in disagreement, even when discussing the same set of data used for
the same purpose. A considerable amount of data quality research involves
investigating and describing various categories of desirable attributes (or
dimensions) of data. These lists commonly include accuracy, correctness, currency, completeness andrelevance. Nearly 200 such terms have been identified and there is
little agreement in their nature (are these concepts, goals or criteria?),
their definitions or measures (Wang et al., 1993). Software engineers may
recognize this as a similar problem to "ilities".
Data
quality control is
the process of controlling the usage of data with known quality measurements
for an application or a process. This process is usually done after a
Data Quality Assurance (QA) process, which consists
of discovery of data inconsistency and correction.
·
Severity
of inconsistency
·
Incompleteness
·
Accuracy
·
Precision
·
Missing
/ Unknown
The
Data QC process uses the information from the QA process to decide to use the
data for analysis or in an application or business process. For example, if a
Data QC process finds that the data contains too many errors or
inconsistencies, then it prevents that data from being used for its intended
process which could cause disruption. For example, providing invalid
measurements from several sensors to the automatic pilot feature on an aircraft
could cause it to crash. Thus, establishing data QC process provides the
protection of usage of data control and establishes safe information usage.
Module
2
Determinization Problem
Having defined the notation, we now can
define the determinization problem:
Definition 1 (Determinization). Given
a set of uncertain objects O = {O1,O2, . . .
,O|O|}, a query workload Q
and a quality metric F, the goal of the deteriminization
problem is for each object O ∈ O to select from WO a set of tags AO ⊆ WO as the deterministic
representation of O, such that F(O,Q) is optimized.
The ground truth tags G are not
associated with uncertain objects and thus not known to any approach to the determinization
problem. Therefore, such algorithms cannot directly measure the quality F(O,Q)
during their execution. Thus, in the following section, we will present our
solution that is based on maximizing the expected quality measure E(F(O,Q)).
Before we describe our solution to the determinization problem, we note that a
baseline algorithm for determinization is a thresholding (threshold-cut)
approach. The thresholding approach employs a pre-defined threshold τ . For
each object O it composes its answer set A by choosing wi tags
from W such that P(wi) ≥ τ . The advantage of this
approach is that it is computationally efficient and potentially can be tuned
to a particular dataset O and workload Q by changing τ .
The main drawback of this approach is that it is unaware of the query
workload (“query-unaware") and thus does not necessarily optimize the
given quality metrics, which leads to lower quality.
Module
3
Branch and Bound Algorithm
Instead of performing a brute-force
enumeration, we can employ a faster branch and bound (BB) technique. The approach
discovers answer sets in a greedy fashion so that answer sets with lower cost
tend to be discovered first.
Outline of the BB algorithm
The algorithm uses a branch and bound
strategy to explore a search tree. Each
node v in the tree corresponds to a partial tag selection where
decisions of whether to include certain tags in the answer set or not have been
made. We capture the partial selection of tags using the concept of sequence.
The performance of the algorithm depends upon the choice of the node
(amongst all the non-leaf nodes) to branch, the computation of upper and lower
bounds of a node, and the choice of the tag used to expand the node.We next
describe these components of the algorithm in detail.
Module
4
Node Selection
The algorithm maintains a priority queue
H for picking a node v that contains the most promising sequence Sv
to continue with. Among the nodes in the priority queue, which one should
we choose to branch next? Let us consider sequence Sv that corresponds
to a node v ∈ H.
For v we define Av as the set of answer sequences that correspond
to the leaf nodes derivable from node v via the branching procedure
described above. That is, Av corresponds to the leaf nodes of the
subtree rooted at node v. For example, or node v2 of the tree in
Fig. 3, Av2 = {Sv8
, Sv9 , Sv10 , Sv11 }.
Then for node v let mv = min A∈Av E(cost(A,G,Q)) be the value of the minimum expected cost
among these sequences. Notice that if we knew the exact value of mv, it
would have been an ideal key for the priority queue H, since it would
lead to the quickest way to find the best sequences. The problem is that the
exact value of mv is unknown when v is branched, since the
subtree rooted at v is not yet constructed at that moment. Even though mv
is unknown it is possible to quickly determine good lower and upper bounds
on its value _v ≤ mv
≤ hv,
without comparing scores of each sequence in Av. For any node v,
if Sv is an answer sequence then the bounds are equal to the cost of the
sequence itself, otherwise the bounds are computed as explained next.
Module
5
Query-Level Optimization
The performance of the BB algorithm can
be significantly improved further by employing query-level optimizations. For a
given sequence Sv of node v, we might be able to exactly
determine the expected cost of certain queries (e.g., Cases 1 and 2 above),
even if Sv is not an answer sequence. In other words, for these queries
we can get their cost without expanding Sv into the corresponding answer
sequences A ∈ Av.
We refer to such cost as fixed cost of v, and to such queries as fixed
or decided queries with respect to O. Thus, each node v is
also associated with the set of undecided queries Qv, its fixed cost ˆcv,
as well as its lower and upper bounds ˆlv and ˆhv on
non-fixed cost. Note that lv = ˆcv +ˆlv and hv = ˆcv + ˆhv.
Fixed cost and fixed queries give us an opportunity to make the BB algorithm
even more efficient by reducing the number of queries involved in computing the
expected cost. When the algorithm performs a branching at a node, it creates
two new nodes vyes and vno, by making yes/no decision for one
undecided tag w. Since one
additional tag is now decided for each new node v, it is possible that
expected cost for some query Q ∈ Qv can be computed (Case 1
and 2). If so, Q becomes a fixed query for the new node and will
not be considered in future computations derived from that node. Based on the
relation between Sv and Q, there are three different cases for
updating ˆcv,
ˆlv and
ˆhv
CONCLUSION:
In this paper we have considered the
problem of determinizing uncertain objects to enable such data to be stored in
pre-existing systems, such as Flickr, that take only deterministic input. The
goal is to generate deterministic representations that optimize the quality of
answers to queries/triggers that execute over the deterministic data representation.
We have proposed efficient determinization algorithms that are orders of
magnitude faster than the enumeration based optimal solution but achieves
almost the same quality as the optimal solution. As future work, we plan to
explore determinization techniques in the context of applications, wherein
users are also interested in retrieving objects in a ranked order.
REFERENCES
[1] D. V. Kalashnikov, S. Mehrotra, J.
Xu, and N. Venkatasubramanian, “A semantics-based approach for speech
annotation of images,” IEEE Trans. Knowl. Data Eng., vol. 23, no. 9, pp.
1373–1387, Sept. 2011.
[2] J. Li and J. Wang, “Automatic
linguistic indexing of pictures by a statistical modeling approach,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 25, no. 9, pp. 1075–1088, Sept.
2003.
[3] C. Wangand, F. Jing, L. Zhang, and
H. Zhang, “Image annotation refinement using random walk with restarts,” in Proc.
14th Annu. ACM Int. Conf. Multimedia, New York, NY, USA, 2006.
[4] B. Minescu, G. Damnati, F. Bechet,
and R. de Mori, “Conditional use of word lattices, confusion networks and
1-best string hypotheses in a sequential interpretation strategy,” in Proc. ICASSP,
2007.
[5] R. Nuray-Turan, D. V. Kalashnikov,
S. Mehrotra, and Y. Yu, “Attribute and object selection queries on objects with
probabilistic attributes,” ACM Trans. Database Syst., vol. 37, no. 1,
Article 3, Feb. 2012.
[6] J. Li and A. Deshpande, “Consensus
answers for queries over probabilistic databases,” in Proc. 28th ACM
SIGMOD-SIGACTSIGART Symp. PODS, New York, NY, USA, 2009.
[7] M. B. Ebarhimi and A. A. Ghorbani,
“A novel approach for frequent phrase mining in web search engine query
streams,” in Proc. 5th Annu. Conf. CNSR, Frederlcton, NB, Canada, 2007.
[8] S. Bhatia, D. Majumdar, and P.
Mitra, “Query suggestions in the absence of query logs,” in Proc. 34th Int.
ACM SIGIR, Beijing, China, 2011.
[9] C. Manning and H. Schutze, Foundations
of Statistical Natural Language Processing, Cambridge, MA, USA: MIT Press, 1999.
[10] D. V. Kalashnikov and S. Mehrotra,
“Domain-independent data cleaning via analysis of entity-relationship graph,” ACM
Trans. Database Syst., vol. 31, no. 2, pp. 716–767, Jun. 2006.
[11] K. Schnaitter, S. Abiteboul, T.
Milo, and N. Polyzotis, “On-line index selection for shifting workloads,” in Proc.
IEEE 23rd Int. Conf. Data Eng. Workshop, Istanbul, Turkey, 2007.
[12] P. Unterbrunner, G. Giannikis, G.
Alonso, D. Fauser, and D. Kossmann, “Predictable performance for unpredictable
workloads,” in Proc. VLDB, Lyon, France, 2009.
[13] R. Cheng, J. Chen, and X. Xie,
“Cleaning uncertain data with quality guarantees,” in Proc. VLDB,
Auckland, New Zealand, 2008.
No comments:
Post a Comment