PAGE:
A PARTITION AWARE ENGINE
FOR PARALLEL GRAPH
COMPUTATION
Abstract—Graph partition
quality affects the overall performance of parallel graph computation systems.
The quality of a graph partition is measured by the balance factor and edge cut
ratio. A balanced graph partition with small edge cut ratio is generally
preferred since it reduces the expensive network communication cost. However,
according to an empirical study on Giraph, the performance over well
partitioned graph might be even two times worse than simple random partitions.
This is because these systems only optimize for the simple partition strategies
and cannot efficiently handle the increasing workload of local message
processing when a high quality graph partition is used. In this paper, we
propose a novel partition aware graph computation engine named PAGE, which
equips a new message processor and a dynamic concurrency control model. The new
message processor concurrently processes local and remote messages in a unified
way. The dynamic model adaptively adjusts the concurrency of the processor
based on the online statistics. The experimental evaluation demonstrates the
superiority of PAGE over the graph partitions with various qualities
EXISTING SYSTEM:
This work is related to
several research areas. Not only graph computation systems are touched, but
also the graph partition techniques and effective integration of them are essential
to push forward current parallel graph computation systems. Here we briefly
discuss these related research directions as follows. Graph computation
systems. Parallel graph computation is a popular technique to process and
analyze large scale graphs. Different from traditional big data analysis
frameworks, most of graph computation systems store graph data in memory and
cooperate with other computing nodes via message passing interface . Besides,
these systems adopt the vertex-centric programming model and release users from
the tedious communication protocol. Such systems can also provide
fault-tolerant and high scalability compared to the traditional graph
processing libraries, such as Parallel BGL and CGMgraph. There exist several excellent
systems, like Pregel, Giraph, GPS, Trinity. Since messages are the key
intermediate results in graph computation systems, all systems apply
optimization techniques for the message processing. Pregel and Giraph handle
local message in computation component but concurrently process remote
messages. This is only optimized for the simple random partition, and cannot
efficiently use the well partitioned graph. Based on Pregel and Giraph, GPS applies several other optimizations for the
performance improvement. One for message processing is that GPS uses a
centralized message buffer in a worker to decrease the times of synchronization.
This optimization enables GPS to utilize high quality graph partition. But it
is still very preliminary and cannot extend to a variety of graph computation
systems. Trinity optimizes the global
message distribution with bipartite graph partition techniques to reduce the
memory usage, but it does not discuss the message processing of a single
computing node. In this paper, PAGE focuses on the efficiency of a worker
processing messages.
PROPOSED SYSTEM:
The main contributions of
our work are summarized as follows:
_ We propose the problem
that existing graph computation systems cannot efficiently exploit the benefit of
high quality graph partitions.
_ We design a new
partition aware graph computation engine, called PAGE. It can effectively harness
the partition information to guide parallel processing resource allocation, and
improve the computation performance.
_ We introduce a dual
concurrent message processor. The message processor concurrently processes
incoming messages in a unified way and is the cornerstone in PAGE.
_ We present a dynamic
concurrency control model. The model estimates concurrency for dual concurrent message
processor by satisfying the producer consumer constraints. The model always
generate proper configurations for PAGE when the graph applications or
underlying graph partitions change.
_ We implement a prototype
of PAGE and test it with real-world graphs and various graph algorithms. The
results clearly demonstrate that PAGE performs efficiently over various graph
partitioning qualities. This paper extends a preliminary work [32] in the
following aspects. First, we detailed analyze the relationship among the
workload of message processing, graph algorithms and graph partition. Second,
technical specifics behind the dynamic concurrency control model are analyzed clearly.
Third, the practical dynamic concurrency control model, which estimates the
concurrency in incremental fashion, is discussed. Fourth, to show the
effectiveness of DCCM, the comparison experiment between DCCM and manual tuning
are conducted. Fifth, to show the advantage and generality of PAGE, more graph
algorithms, such as diameter estimator, breadth first search (BFS), single
source shortest path (SSSP), are ran on PAGE.
Module 1
Graph Algorithms
In reality, besides the
graph partition, the actual workload of message processing in an execution
instance is related to the characteristics of graph algorithms as well. Here we
follow the graph algorithm category. On basis of the communication
characteristics of graph algorithms when running on a vertex-centric graph computation
system, they are classified into stationary graph algorithms and non-stationary
graph algorithms. The stationary graph algorithms have the feature that all
vertices send and receive the same distribution of messages between supersteps,
like PageRank. In contrast, the destination or size of the outgoing messages changes
across supersteps in the non-stationary graph algorithms. For example,
traversal-based graph algorithms, e.g., breadth first search and single source
shortest path, are the non-stationary ones. In the stationary graph algorithms,
every vertex has the same behavior during the execution, so the workload of message
processing only depends on the underlying graph partition. When a high quality
graph partition is applied, the local message processing workload is higher
than the remote one, and vice versa. The high quality graph partition helps
improve the overall performance of stationary graph algorithms, since
processing local messages is cheaper than processing remote messages, which has
a network transferring cost. For the traversal-based graph algorithms belonging
to the non-stationary category, it is also true that the local message processing
workload is higher than the remote one when a high quality graph partition is
applied. Because the high quality graph partition always clusters a dense sub graph
together, which is traversed in successive super steps. However, the high
quality graph partition cannot guarantee a better overall performance for the
non-stationary ones, because of the workload imbalance of the algorithm itself.
This problem can be solved by techniques. In this paper, we focus on the
efficiency of a worker when different quality graph partitions are applied. The
systems finally achieve better performance by improving the performance of each
worker and leave the workload imbalance to the dynamic repartition solutions.
The next subsection will reveal the drawback in the existing systems when
handling different quality graph partitions.
Module 2
Message processing
In Pregel-like graph
computation systems, vertices exchange their status through message passing.
When the vertex sends a message, the worker first determines whether the destination
vertex of the message is owned by a remote worker or the local worker. In the
remote case, the message is buffered first. When the buffer size exceeds a
certain threshold, the largest one is asynchronously flushed, delivering each
to the destination as a single message. In the local case, the message is
directly placed in the destination vertex’s incoming message queue . Therefore,
the communication cost in a single worker is split into local communication
cost and remote communication cost. Combining the computation cost, the overall
cost of a worker has three components. Computation cost, denoted by tcomp, is
the cost of execution of vertex programs. Local communication cost, denoted by
tcomml, represents the cost of processing messages generated by the worker itself.
Remote communication cost, denoted by tcommr, includes the cost of sending
messages to other workers and waiting for them processed. In this paper, we use
the cost of processing remote incoming messages at local to approximate the
remote communication cost. There are two reasons for adopting such an
approximation. First, the difference between two costs is the network
transferring cost, which is relatively small compared to remote message
processing cost. Second, the waiting cost of the remote communication cost is
dominated by the remote message processing cost. The workload of local (remote)
message processing determines the local (remote) communication cost. The graph
partition influences the workload distribution of local and remote message
processing. A high quality graph partition, which is balanced and has small
edge cut ratio, usually leads to the local message processing workload is
higher than the remote message processing workload, and vice versa.
Module 3
PAGE
PAGE, which stands for Partition
Aware Graph computation Engine, is designed to support different graph
partition
qualities and maintain
high performance by an adaptively tuning mechanism and new cooperation methods.
Similar to the majority of existing parallel graph computation systems, PAGE
follows the master-worker paradigm. The computing graph is partitioned and distributive
stored among workers’ memory. The master is responsible for aggregating global
statistics and coordinating global synchronization. Thus the workers in PAGE can
be aware of the underlying graph partition information and optimize the graph
computation task.
Module 4
Dual concurrent message processor
The dual concurrent
message processor is the core of the enhanced communication model, and it
concurrently processes local and remote incoming messages in a unified way.
With proper configurations for this new message processor, PAGE can efficiently
deal with incoming messages over various graph partitions with different
qualities. Messages are delivered in block, because the network communication
is an expensive . But this optimization raises extra overhead in terms that
when a worker receives incoming message blocks, it needs to parse them and
dispatches extracted messages to the specific vertex’s message queue. In PAGE,
the message process unit is responsible for this extra overhead, and it is a
minimal independent process unit in the communication module. A remote (local)
message process unit only processes remote (local) incoming message blocks. The
message processor is a collection of message process units. The remote (local)
message processor only consists of remote (local) message process units.
Module 5
Dynamic concurrency control model
The concurrency of dual
concurrent message processor heavily affects the performance of PAGE. But it is
expensive
and also challenging to
determine a reasonable concurrency ahead of real execution without any
assumption. Therefore, PAGE needs a mechanism to adaptively tune the
concurrency of the dual concurrent message processor. The mechanism is named
Dynamic Concurrency Control Model, DCCM for short. In PAGE, the concurrency
control problem can be modeled as a typical producer-consumer scheduling
problem, where the computation phase generates messages as a producer, and message
process units in the dual concurrent message processor are the consumers.
Therefore, the producer- consumer constraints should be satisfied when solving the
concurrency control problem. For the PAGE situation, the concurrency control
problem rises consumer constraints. Since the behavior of producers is
determined by the graph algorithms, PAGE only requires to adjust the consumers
to satisfy the constraints (behavior of graph algorithms), which are stated as
follows. First, PAGE provides sufficient message process units to make sure
that new incoming message blocks can be processed immediately and do not block
the whole system. Meanwhile, no message process unit is idle. Second, the
assignment strategy of these message process units ensures that each
local/remote message process unit has balanced workload since the disparity can
seriously destroy the overall performance of parallel processing.
CONCLUSION
In this paper, we have
identified the partition unaware problem in current graph computation systems
and its severe drawbacks for efficient parallel large scale graphs processing.
To address this problem, we proposed a partition aware graph computation engine
named PAGE that monitors three high-level key running metrics and dynamically adjusts
the system configurations. In the adjusting model, we elaborated two heuristic
rules to effectively extract the system characters and generate proper
parameters. We have successfully implemented a prototype system and conducted
extensive experiments to prove that PAGE is an efficient and general parallel
graph computation engine.
REFERENCES
[1] Giraph. (2013).
[Online]. Available: https://github.com/apache/giraph
[2] Storm. (2014).
[Online]. Available: http://storm.incubator.apache.org/
[3] A. Amr, S. Nino, N.
Shravan, J. Vanja, and S. J. Alexander, “Distributed largescale natural graph
factorization,” in Proc. 22nd Int. Conf. World Wide Web, 2013, pp.
37–48.
[4] N. Backman, R.
Fonseca, and U. C¸ etintemel, “Managing parallelism for stream processing in
the cloud,” in Proc. 1st Int. Workshop Hot Topics Cloud Data Process., 2012,
pp. 1:1–1:5.
[5] L. Backstrom, D.
Huttenlocher, J. Kleinberg, and X. Lan, “Group formation in large social
networks: Membership, growth, and evolution,” in Proc. 12th ACM SIGKDD Int.
Conf. Knowl. Discovery Data Mining, 2006, pp. 44–54.
[6] P. Boldi and S. Vigna,
“The webgraph framework I: Compression techniques,” in Proc. 13th Int. Conf.
World Wide Web, 2004, pp. 595– 602.
[7] P. Boldi, M. Rosa, M.
Santini, and S. Vigna, “Layered label propagation: A multiresolution
coordinate-free ordering for compressing social networks,” in Proc. 20th Int.
Conf. World Wide Web, 2011, pp. 587–596.
[8] S. Brin and L. Page,
“The anatomy of a large-scale hypertextual web search engine,” in Proc. 7th
Int. Conf. World Wide Web, 1998, pp. 107–117.
[9] A. Chan, F. Dehne, and
R. Taylor, “CGMGRAPH/CGMLIB: Implementing and testing CGM graph algorithms on
PC clusters and shared memory machines,” J. High Perform. Comput. Appl., pp.
81–97, 2005.
[10] G. Cong, G. Almasi,
and V. Saraswat, “Fast PGAS implementation of distributed graph algorithms,” in
Proc. Int. Conf. High Perform. Comput., Netw., Storage Anal., 2010, pp. 1–11.
[11] J. Dean and S.
Ghemawat, “MapReduce: Simplified data processing on large clusters,” in Proc.
Operating Syst. Des. Implementation, 2004, pp. 107–113.
[12] D. Gregor and A.
Lumsdaine, “The parallel BGL: A generic library for distributed graph
computations,” in Proc. Parallel Object-Oriented Sci. Comput., 2005, pp. 1–18.
[13] C. A. R. Hoare,
“Communicating sequential processes,” Commun. ACM, vol. 21, pp. 666–677, 1978.
[14] U. Kang, C. E.
Tsourakakis, and C. Faloutsos, “PEGASUS: A petascale graph mining system
implementation and observations,” in Proc. IEEE 9th Int. Conf. Data Mining,
2009, pp. 229–238.
No comments:
Post a Comment