Getting Access to our Datasets

Unless you're prepared with the hardware to perform the analysis yourself already, you'll probably want to use existing datasets we make available.

Access methods are being discussed in #19, but not all are available yet. Here are some currently existing methods for getting access:

Read before running queries:

If you plan on creating query output larger than 50GB: do not use the boot drive (/home, /tmp) to store the results. Use scratch space instead. If you would like more long-term storage for a project, let us know and we'll create a unix group and shared folder for you.

If your query is expected to exceed ~1TB of output, please make sure you get our approval before running it!

Caladan: Please do not use more than 128GB of memory (1GB per hardware thread), ArangoDB performance will suffer massively or processes will get OOM'd. (512GB might seem like a lot, but we have already accidentally OOM'd Caladan a few times...)

Our servers

Caladan

Caladan is our lab's compute server which is oriented towards massively parallel operations and large memory datasets, but does not have enough storage to fit a complete WST database derived from all the GitHub repos.

Thus, only the smaller (Stars 1k+, Stars 100+, and others) datasets are available there, but for most cases it is recommended to run queries against the smaller datasets first to debug and review performance before running on a larger database hosted on Seneca.

Until we get other access methods made available, you'll have to contact us to have an account created and access to the database granted. We welcome anyone with a verifiable @*.edu email address and/or research association to contact us to request access. Currently you should email @robobenklein (project lead) and @azhenley (advisor & PAIRS director) with an introduction of yourself or your group!

Logging in to Caladan is only accessible via SSH with key-based authentication. The address is caladan.eecs.utk.edu.

If you wish to run analysis code locally (you should have a very good / 200mbps+ internet connection) you can simply use SSH TCP forwarding to access Arango: ssh caladan -L 8529:localhost:8529

For those without a good internet connection we recommend running queries directly on the server and storing any results or working files you might need in scratch space, which is located at /mnt/atreides/scratch. Please avoid using your home directory to store output from queries since the home directory drive is less performant and smaller compared to the scratch space which is mounted on an all-SSD array.

While logged in to Caladan you'll still need to use an account to access ArangoDB, your DB username and password will be given to you by one of our lab members.

Seneca

Seneca is the server oriented towards storing the largest dataset we can create with WST at the moment. Since it's oriented towards storage, it's drives are spinning disks and it is much slower to query, so be sure that your queries run as intended on Caladan first!

It's setup is similar, but of course database names are different and it is a separate physical server. Caladan is directly connected to Seneca via multiple 10G links, so if you are outputting a query result that might be more than a few GB you should write it to Caladan's faster scratch space instead of transferring it directly over the 1gbit internet connection during the run.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Getting Access to our Datasets

Read before running queries:

Our servers

Caladan

Seneca

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally