-
Couldn't load subscription status.
- Fork 5
Stale wiki pages
In all subsequent experiments Aspen was built according to the specs of its developers (compiled with g++-7 -fcilkplus with Cilk Plus runtime libraries) and given access to all 48 cores and full memory on the new lab machine.
| Input Stream | Ingestion Speed (# 10^6 updates / sec) | Total Runtime (sec) |
|---|---|---|
| kron_13_fourth | 1.64828 | 10.597 |
| kron_15_fourth | 1.27607 | 218.918 |
| kron_16_fourth | 1.22838 | 909.619 |
| kron_17_fourth | 1.20785 | 3700.27 |
| Input Stream | Ingestion Speed (# 10^6 updates / sec) | Total Runtime (sec) |
|---|---|---|
| kron_13_fourth | 1.47915 | 11.8092 |
| kron_15_fourth | 1.12393 | 248.551 |
| kron_16_fourth | 1.12532 | 992.933 |
| kron_17_fourth | 1.07073 | 4174.12 |
On the kron_13_fourth stream:
| Batch Size (# updates) | Ingestion Speed (# 10^6 updates / sec) | Total Runtime (sec) |
|---|---|---|
| 10^3 | 1.64084 | 10.645 |
| 10^4 | 2.68386 | 6.51 |
| 10^5 | 3.07802 | 5.676 |
| 10^6 | 3.4112 | 5.122 |
| 10^7 | 3.42795 | 5.099 |
Ran against naive Kruskal's, NetworkX and our implementation of connected components on graphs with density n^2*0.95 Time is measured in wall-clock seconds.
| n | Naive | NetworkX | Streaming |
|---|---|---|---|
| 10^3 | 1s | 183 (100K) | |
| 5*10^3 | 61 (1.5G) | ||
| 6*10^3 | 1m 31s (2.9G) | ||
| 7*10^3 | 2m 6s (3.6G) | ||
| 10^4 | 4m 58s (6.1G) | 30000+ (2.3G) | |
| 1.1*10^4 | 6m 4s (7.1G) | ||
| 1.2*10^4 | 7m 24s (11.8G) | ||
| 1.3*10^4 | 8m 43s (13.1G) | ||
| 1.4*10^4 | 10m 9s (14.7G) | ||
| 1.5*10^4 | 11m 50s (16.2G) | ||
| 1.6*10^4 | 13m 29s (17.8G) | ||
| 2*10^4 | (21.9G) |
| n | Naive | NetworkX | Streaming |
|---|---|---|---|
| 10^3 | |||
| 5*10^3 | 26m 1s | ||
| 6*10^3 | 51m 53s | ||
| 7*10^3 | 12h+ (^C) | ||
| 10^4 | 28h+ (^C) | ||
| 10^5 |
| n | Naive | NetworkX | Streaming |
|---|---|---|---|
| 1.2*10^4 | 1h 15m 44s | ||
| 1.3*10^4 | 1h 27m 22s | ||
| 1.4*10^4 | 2h 10m 20s | ||
| 1.5*10^4 | 5h 4m 12s | ||
| 1.6*10^4 | 10h 46m 47s* | ||
| 10^5 |
*: Dubious, since the program did not terminate on its own, but still gave output that signaled the end of execution.
Values in tables are numbers averaged over 10 trials (different graphs). In each experiment we profile ingestion and algorithm execution (this is necessary given the way the bipartiteness test reduction to CC must be implemented). For random graph generation: the probability of an edge appearing was set to 0.03, the geometric insertion/removal parameter was set to 0.5, and there is no limit to how many times an edge may appear in the stream. Experiments were conducted on the lab machine.
| n | sketch-based CC | bipartiteness reduction | BGL is_bipartite()
|
|---|---|---|---|
| 500 | 50 | 264 | 1.2 |
| 750 | 173 | 779.8 | 2.9 |
| 1000 | 332.5 | 1391.7 | 5.5 |
| 1250 | 667.9 | 2584.7 | 8 |
| 1500 | 1041.6 | 3935.9 | 13.9 |
| 1750 | 1439.6 | 5341.5 | 17 |
| 2000 | 1895.9 | 6971.4 | 28.3 |
| 2250 | 2812.5 | 10074 | 34.4 |
| 2500 | 3492.5 | 12402.5 | 41.1 |
| 2750 | 4248.3 | 15011.1 | 49.1 |
| 3000 | 5392 | 18781.4 | 59.8 |
| 3250 | 6334.3 | 21999.4 | 71.8 |
| 3500 | 7357 | 25484.6 | 87.7 |
| 3750 | 8453.8 | 29244 | 104.9 |
| 4000 | 9622.4 | 33257.5 | 117.4 |
| 4250 | 12440.7 | 42065.6 | 143.4 |
| 4500 | 13948.8 | 47178.1 | 169.5 |
| 4750 | 15566.2 | 52632.4 | 195.4 |
| 5000 | 17235.3 | 58344 | 223.2 |
Each entry in table is average peak number of bytes. Data recorded using valgrind tool massif with the flags --depth=1 and --ignore-fn=main enabled.
| n | sketch-based CC | bipartiteness reduction | BGL is_bipartite()
|
|---|---|---|---|
| 500 | 22269440.0 | 83413488.0 | 413521.0 |
| 750 | 45939440.0 | 168471488.0 | 811405.8 |
| 1000 | 61225440.0 | 224601488.0 | 1373709.0 |
| 1250 | 93391440.0 | 338251488.0 | 2198597.8 |
| 1500 | 122613440.0 | 441789488.0 | 2982961.8 |
| 1750 | 143035440.0 | 515407488.0 | 3832357.8 |
| 2000 | 163457440.0 | 589025488.0 | 5116566.6 |
| 2250 | 220455440.0 | 785907488.0 | 6926131.4 |
| 2500 | 244941440.0 | 873221488.0 | 8719897.8 |
| 2750 | 269427440.0 | 960535488.0 | 10233761.0 |
| 3000 | 319257440.0 | 1133097488.0 | 11682636.2 |
| 3250 | 345855440.0 | 1227515488.0 | 13144881.8 |
| 3500 | 372453440.0 | 1321933488.0 | 14737891.4 |
| 3750 | 399051440.0 | 1416351488.0 | 16751365.0 |
| 4000 | 425649440.0 | 1510769488.0 | 19643921.8 |
| 4250 | 534119440.0 | 1878547488.0 | 23607701.8 |
| 4500 | 565533440.0 | 1989045488.0 | 28013207.4 |
| 4750 | 596947440.0 | 2099543488.0 | 31976669.8 |
| 5000 | 628361440.0 | 2210041488.0 | 35195520.2 |
| n | sketch-based CC | bipartiteness reduction | BGL is_bipartite()
|
|---|---|---|---|
| 500 | 22381520.0 | 83781592.0 | 485070.4 |
| 750 | 46131520.0 | 169071608.0 | 964816.0 |
| 1000 | 61481536.0 | 225401640.0 | 1638487.2 |
| 1250 | 93731584.0 | 339351816.0 | 2604870.4 |
| 1500 | 123021552.0 | 443110040.0 | 3557333.6 |
| 1750 | 143511536.0 | 516948104.0 | 4610290.4 |
| 2000 | 164001536.0 | 590786152.0 | 6132981.6 |
| 2250 | 221139664.0 | 788031800.0 | 8215536.8 |
| 2500 | 245701680.0 | 875581832.0 | 10311092.0 |
| 2750 | 270263712.0 | 963131880.0 | 12146176.0 |
| 3000 | 320170000.0 | 1135931224.0 | 13953377.6 |
| 3250 | 346844016.0 | 1230585288.0 | 15796678.4 |
| 3500 | 373518000.0 | 1325239384.0 | 17802533.6 |
| 3750 | 400191984.0 | 1419893608.0 | 20267727.2 |
| 4000 | 426866064.0 | 1514547688.0 | 23637584.8 |
| 4250 | 535479584.0 | 1882900600.0 | 28105636.8 |
| 4500 | 566973616.0 | 1993654728.0 | 33043212.8 |
| 4750 | 598467600.0 | 2104408152.0 | 37566742.4 |
| 5000 | 629961616.0 | 2215162056.0 | 41363148.0 |
Performance numbers using pthread for the graph workers and without any sketch level parallelism.
Runtime given is seconds (s) or minutes (m). Number within brackets is average updates per second (k is thousand, m is million).
| Nodes | 1000 | 5000 | 10000 | 15000 |
|---|---|---|---|---|
| Edge Updates | 10^6 | ~2.4 x 10^7 | ~9.5 x 10^7 | ~2.1 x 10^8 |
| Datastruct Size | 63MB | 590MB | 1.5GB | 2.5 GB |
| Runtime Standard | 13.5s [74k] | 11.6m [34k] | 57.4m [28k] | 138m [25k] |
| Runtime FBT +1 | 5.2s [191k] | 3.5m [112k] | 16.6m [95k] | 40.1m [87k] |
| Runtime FBT +2 | 2.9s [342k] | 1.9m [206k] | 9.2m [172k] | 22.6m [154k] |
| Runtime FBT +4 | 1.7s [578k] | 1.1m [353k] | 5.4m [292k] | 13.8m [253k] |
| Runtime FBT +8 | 1.1s [903k] | 41s [584k] | 3.3m [467k] | 9.1m [386k] |
| Runtime FBT +16 | 0.9s [1.1m] | 31s [773k] | 2.6m [614k] | 7.4m [470k] |
These numbers are from raw ingestion into the buffer tree without any sort of graph streaming work being done
| Nodes | Branch Factor | Buffer Size | Number of Updates | Runtime [rate /s] |
|---|---|---|---|---|
| 512 | 4 | 1 MB | 33.5 million | 2.6s [12.9m] |
| 512 | 16 | 1 MB | 33.5 million | 1.7s [19.5m] |
| 1024 | 16 | 2 MB | 268 million | 15.6s [17.2m] |
Testing the cluster on a simple word count Apache Spark program.
Input size: 600MB
4 worker nodes:
--- 166.97850894927979 seconds ---
--- 163.22055530548096 seconds ---
--- 164.37560272216797 seconds ---
--- 171.68829035758972 seconds ---
3 worker nodes:
--- 187.09018564224243 seconds ---
--- 184.9761152267456 seconds ---
--- 188.4610195159912 seconds ---
2 worker nodes:
--- 361.66504549980164 seconds ---
--- 362.92039012908936 seconds ---
--- 361.94735503196716 seconds ---
1 worker nodes:
--- 1067.6949362754822 seconds ---
--- 1059.5218110084534 seconds ---
--- 1053.6695115566254 seconds ---
0 node:
--- 2056.108319759369 seconds ---
--- 2004.2446208000183 seconds ---
--- 2048.037629365921 seconds ---
34 worker nodes: ---
--- 84.35914134979248 seconds --- ---
--- 82.6325478553772 seconds --- ---
--- 80.98088264465332 seconds --- ---
GraphX by default works for Scala only. There is a Python wrapper for GraphFrames which is what I'm using. I was able to run their connected components algorithm on the facebook friends graph sample that they have, but this was done locally (not on the cluster).
- Use spark-submit instead of pyspark so I can run it on the cluster. (Could use advice from Tyler but can do this myself if he's busy)
- Set up a Hadoop file system on the cluster. (Need some advice from Tyler)
- Modify the ansible playbook so the workers can downgrade scala/spark version themselves. Almost done with this.
- Figure out how to load our own input graphs into the algorithm. (Evan might have some suggestions on this).
- Get the final datasets from Ahmed onto the cluster and test GraphX on it. How big are the datasets? The master node of the cluster only has 350GB of storage and each worker has 30GB. I do not know if HDFS allows us to "spread" a large dataset across all the nodes.
- Make it so we can vary the number of nodes. There should be a way to do this efficiently/elegantly with spark-submit, but if not we can just use ansible.
16 GB RAM restriction experiments using Tyler's read buffering (buffer size 10^6) and batch size 10^6.
| Input Stream | RES (GiB) | SWAP (GiB) | Overall Throughput (updates / second) | Total Ingestion Time (seconds) | Total Runtime (seconds) |
|---|---|---|---|---|---|
| kron13 | 2.4 | 0 | 3.41272e+06 | 6.43542 | 6.45255 |
| kron15 | 2.8 | 0 | 3.19362e+06 | 92.6225 | 92.7011 |
| kron16 | 4.1 | 0 | 2.99069e+06 | 372.126 | 372.44 |
| kron17 | 7.3 | 0 | 2.91467e+06 | 1532.25 | 1533.1 |
| kron18 | 16.9 | 13.5 | 303291 | 58916.7 | 59369.1 |
| Input Stream | RES (GiB) | SWAP (GiB) | Overall Throughput (updates / second) | Total Ingestion Time (seconds) | Total Runtime (seconds) |
|---|---|---|---|---|---|
| kron13 | 0.577 | 0 | 141012 | 155.748 | 155.807 |
| kron15 | 6.0 | 0 | 135892 | 2176.73 | 2177.15 |
| kron16 | |||||
| kron17 | |||||
| kron18 |
Given the latest sketch update speed test results per Kenny, we can now estimate the input sizes at which our data structure becomes smaller than that of various alternatives (Boost, Networkx). We can also estimate how long our algorithm will take to run on such input sizes, giving us a sense of how close we are to outperforming the current state of the art for in-memory computation.
The above plot displays the space curves for 3 competitors (Networkx's edge list data structure, and Boost's adjacency list and adjacency matrix data structures). All 3 have O(n^2) space complexity but the constants differ. We also plot space curves for 3 variants of our sketching data structure: the currently implemented version which uses 16B per bucket, a trimmed version which reduces the size of error-checking variable c resulting in 12B per bucket, and a further trimmed version which uses half the typical number of buckets (reducing the number of repetitions used to avoid failure whp) for an effective 6B per bucket. We denote the input sizes at which each of our sketching data structures requires no more memory than the most compact alternative (Boost adjacency matrix), and note the data structure sizes of these break-even points.
We can now estimate how long it will take for our algorithm to run on the input sizes specified by these break-even points.

Further areas to consider:
- What do we need to know to make similar comparisons for existing external-memory graph solutions?
- How does our algorithm work on more "typical" inputs (larger node set, much sparser graphs) that are present in graph benchmarks such as the GAP benchmark suite? It's likely that on such graphs, we need a lot of space (which we hope can simply be on disk) but suffer little to no penalty in stream processing time per edge (and since the graphs are sparse, the edge stream should be much shorter). Ex: On GAP's Kron benchmark graph with 135M nodes and 2B edges, we only expect about a 5x slowdown in sketch update speed compared to a graph on 100K nodes (assuming that l_0 sketching on longer vectors doesn't matter much - Kenny will verify). However, our data structure size for this graph will be about 60 TB, forcing us to store this structure on disk and use WODS techniques to make sure that stream ingestion is I/O efficient. However, each sketch will only be about 0.5 MB, so they are still easy to distribute.
We need to decide which distributed framework to target since there may be issues with us using Spark. Add details about different options (including what we know about Spark) here.
Notes on integrating C++ code into PySpark https://www.slideshare.net/databricks/integrating-existing-c-libraries-into-pyspark-with-esther-kundin
- People have done it before
- This guide might be helpful
- There are some performance considerations - they recommend not keeping state in the C++ code? Am I misinterpreting that?
- seems like a fair bit of added complexity. Though I'm not sure how much.
Another tutorial on using C++ with Spark. At a quick glance this one seems simpler. https://medium.com/@thomaspt748/integrate-c-c-libraries-dll-so-into-apache-spark-scala-in-hadoop-cluster-56cdb26e15ea
Thrill - a C++ native distributed big data framework https://github.com/thrill/thrill
- C++ native should make integration much easier
- far less used than Spark, not an industry standard
- EXPERIMENTAL
Apache Spark
- Built on top of Hadoop, requires knowledge and understanding of Hadoop
- Widely used in industry and lots of functionality/features
- Only available in Java/Scala/Python
Open MPI
- Available in C/C++
- API forces user to handle inter-process communication
Map Reduce for C/C++ framework on Hadoop
- GitHub repo for the code: https://github.com/google/mr4c
- Article written about it: https://gigaom.com/2015/02/18/google-open-sources-a-mapreduce-framework-for-c/
- It hasn't been updated for awhile. Could be complete, could be deprecated.
- Spark is known to run faster than MapReduce?
Naive impl: insert, periodically query all updates for each sketch, batch all of them, delete from DS
- Put
- Get
- Delete
Better impl: Insert and modify flushing algorithm to batch updates given some heurstic (e.g. all updates are for same sketch, depth of tree, etc)
- Put
- Low level access to
flush(e.g. custom flush, flush hook)
Given a buffer tree with B^e = 16, buffer size 4096, n = 10^10 for naive, n = 10^5 for better, naive runs roughly 10-100x slower.
Each node in the above tree represents a fully external memory (EM) graph processing model (GPM) in which connected components (CC) is implemented. Each directed edge indicates that the destination node GPM demonstrates a CC algorithm which is faster than the CC algorithm implemented in the source node GPM. The requirement for such a demonstration is an experiment comparing the performance of the two algorithms across graphs varying in scale on identical hardware in the same environment. The sources to the studies supporting each edge is provided below in the references section.
- Although X-Stream is technically not a fully EM GPM, it is included because it is used as a performance milestone for many other fully EM GPMs.
- NXGraph demonstrates it is faster than TurboGraph for real world graphs at most the scale of the Twitter graph (1.47 billion edges), however NXGraph runtime eventually overtakes TurboGraph for larger graphs. This performance superiority on different sized graphs is indicated by the two directed edges between NXGraph and TurboGraph.
- NXGraph claims speedups over GridGraph and MMap, however these experiments only run a single iteration of PageRank for a single graph.
- Mosaic and Graphene are the first fully EM GPMs which support computations on graphs with 1 trillion edges.
- MOSAIC is the first major GPM to begin exploiting NVMe architecture. GraFBoost follows suit, while CLIP extends optional leveraging of NVMe.
- Graphene excludes preprocessing time from its performance comparison, simply claiming that competitor preprocessing time is "similar or longer"
- CLIP compares itself to MOSAIC on a commodity PC, without the massively-parallel Xeon Phi processors MOSAIC was intended to leverage for its graph computation. The MOSAIC paper contains CPU-only tests which still prove it is superior to GridGraph, however it should be understood that CLIP has only been shown to be superior this CPU-only version of MOSAIC, and that for massive graphs (>256 GB representation size) the full MOSAIC framework may or may not outperform CLIP.
- Though LUMOS does not explicitly implement a CC algorithm, the original paper proves that it performs label propagation faster than GridGraph. Label Propagation is the basis for most CC algorithms implemented by GPMs, including GridGraph.
- Similarly, V-Part compares BFS instead of CC against GraFBoost, however these similar "traversal" algorithms are essentially just different versions of the aforementioned label propagation and thus exhibit similar runtimes across GPMs.
Below are the graph data sets used by some of the most popular GPMs. Most other GPMs evaluate themselves on some subset of these graphs as well.
| Graph Name | # Vertices | # Edges |
|---|---|---|
| live-journal | 4.8M | 69M |
| netflix | 0.5M | 99M |
| domain | 26M | 0.37B |
| twitter-2010 | 42M | 1.5B |
| uk-2007-05 | 106M | 3.7B |
| uk-union | 133M | 5.4B |
| yahoo-web | 1.4B | 6.6B |
| Graph Name | # Vertices | # Edges |
|---|---|---|
| netflix | 0.5M | 0.1B |
| 41.7M | 1.4B | |
| Friendster | 65.6M | 1.8B |
| sk-2005 | 50.6M | 1.9B |
| yahoo-web | 1.4B | 6.6B |
The netflix graph is bipartite.
| Graph Name | # Vertices | # Edges |
|---|---|---|
| LiveJournal | 4.85M | 69M |
| 61.6M | 1.47B | |
| UK | 106M | 3.74B |
| Yahoo | 1.41B | 6.64B |
| Graph Name | # Vertices | # Edges |
|---|---|---|
| rmat24 | 16.8M | 0.3B |
| 41.6M | 1.5B | |
| rmat27 | 134.2M | 2.1B |
| uk2007-05 | 105.8M | 3.7B |
| hyperlink14 | 1.7246B | 64.4B |
| rmat-trillion | 4.2949B | 1T |
Note: rmat data sets are synthetic graphs.
The experimental evidence supporting (almost) each edge in the graph may be found in the original publication of its destination node GPM.
- GraphChi
- X-Stream
- TurboGraph
- LigraChi-g
- MMap
- PrefEdge
- BPP
- GridGraph
- GPSA
- PathGraph
- DS-GraphChi (GraphChi with dynamic shards)
- VENUS
- NXGraph
- AsyncStripe
- Graphene
- MOSAIC
- GraFBoost (Soft)
- Wonderland
- CLIP
- V-Part
- LUMOS
- The edge from DS-GraphChi to X-Stream is supported by the DS-GraphChi paper.
- The edge from LigraChi-g to GridGraph is supported by the Wonderland paper.
The above "imporvement tree" is incomplete. The LaTeX source code used to generate the above improvement tree using the tikz package is provided here for inclusion in other documents or expansion of the tree to include more GPMs and experiments.
Click to show
% Node style
\tikzstyle{textcontainer}=[fill=none, draw=black, shape=rectangle]
% Edge style
\tikzstyle{diredge}=[->]
\begin{tikzpicture}
\begin{pgfonlayer}{nodelayer}
\node [style=textcontainer] (0) at (0, 0) {GraphChi};
\node [style=textcontainer] (1) at (24, -12) {TurboGraph};
\node [style=textcontainer] (2) at (0, -12) {X-Stream};
\node [style=textcontainer] (4) at (-12, -12) {LigraChi-g};
\node [style=textcontainer] (5) at (24, -18) {MMap};
\node [style=textcontainer] (6) at (17.25, -4.25) {BPP};
\node [style=textcontainer] (7) at (-24, -12) {PrefEdge};
\node [style=textcontainer] (8) at (0, -24) {GridGraph};
\node [style=textcontainer] (9) at (-12, -24) {GPSA};
\node [style=textcontainer] (10) at (-24, -24) {PathGraph};
\node [style=textcontainer] (11) at (-3, -6) {DS-GraphChi};
\node [style=textcontainer] (12) at (12, -24) {VENUS};
\node [style=textcontainer] (13) at (12, -12) {NXGraph};
\node [style=textcontainer] (14) at (12, -36) {AsyncStripe};
\node [style=textcontainer] (16) at (0, -36) {MOSAIC};
\node [style=textcontainer] (17) at (-24, -36) {Graphene};
\node [style=textcontainer] (18) at (24, -24) {GraFBoost (Soft)};
\node [style=textcontainer] (19) at (-12, -36) {Wonderland};
\node [style=textcontainer] (20) at (0, -48) {CLIP};
\node [style=textcontainer] (21) at (24, -30) {V-Part};
\node [style=textcontainer] (22) at (24, -36) {LUMOS};
\end{pgfonlayer}
\begin{pgfonlayer}{edgelayer}
\draw [style=diredge] (0) to (1);
\draw [style=diredge] (0) to (2);
\draw [style=diredge] (0) to (4);
\draw [style=diredge] (1) to (5);
\draw [style=diredge] (0) to (6);
\draw [style=diredge] (0) to (7);
\draw [style=diredge] (2) to (8);
\draw [style=diredge] (2) to (9);
\draw [style=diredge] (2) to (10);
\draw [style=diredge] (0) to (11);
\draw [style=diredge] (11) to (2);
\draw [style=diredge] (2) to (12);
\draw [style=diredge, bend right=15] (1) to (13);
\draw [style=diredge, bend right=15] (13) to (1);
\draw [style=diredge] (0) to (13);
\draw [style=diredge] (8) to (14);
\draw [style=diredge] (8) to (16);
\draw [style=diredge] (8) to (17);
\draw [style=diredge] (2) to (18);
\draw [style=diredge] (4) to (8);
\draw [style=diredge] (8) to (19);
\draw [style=diredge] (16) to (20);
\draw [style=diredge] (18) to (21);
\draw [style=diredge] (8) to (22);
\end{pgfonlayer}
\end{tikzpicture}
open image in new tab to click around in the graph -- (not working right now for some reason... you can click locally though)
Specifically the commands I used to monitor the cpu activity on the system were:
perf record -F 99 -a -g -- [command (ie ./experiment or sleep)]
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > perf-kernel.svg
This records activity across the entire system while the command is running. Therefore we can record for a time window. Ie 60 seconds using sleep.




