Relationship Between Temperature, Homeless Encampments, and Criminal Activities in New York City (2018 - 2019)
Data Source 1: U.S. Local Climatological Data: NY CITY CENTRAL PARK
Input Location: /user/temperature/hw7/input/temp2018.csv
AND /user/temperature/hw7/input/temp2019.csv
in HDFS
- In
profiling_code/temperature
, for initial profiling:- run
chmod +x initial_profiling.sh
- run
./initial_profiling.sh
- run
hdfs dfs -cat /user/temperature/hw7/outputPart1_1/part-r-00000
to check the result
- run
- In
etl_code/temperature
, for initial cleaning:- run
chmod +x part2.sh
- run
./part2.sh
- run
- In
profiling_code/temperature
, for profiling of the initial cleaning:- run
chmod +x part1.sh
- run
./part1.sh
- run
hdfs dfs -cat /user/temperature/hw7/outputPart1_1/part-r-00000
to check the result
- run
- Turn the result into a csv file:
- run
hdfs dfs -mv /user/temperature/hw7/outputPart2/part-r-00000 /user/temperature/cleaned_temperature.csv
- run
- In
profiling_code/temperature
, for further cleaning and profiling in pySpark:- run
module load python/gcc/3.7.9
- run
PYTHONSTARTUP=temperature_profile_clean.py pyspark --deploy-mode client
- run
exit()
- run
- Upload the result csv file
temperature_joined_ready.csv
to HDFS:- run
hdfs dfs -put temperature_joined_ready.csv /user/temperature
- run
- In
ana_code
, for analysis the temperature dataset:- run
module load python/gcc/3.7.9
- run
PYTHONSTARTUP=temperature_ana.py pyspark --deploy-mode client
- run
Data Source 2: NYPD Criminal Court Summons (Historic)
Input Location on hdfs: /user/summon/hw7/input/NYPD_Criminal_Court_Summons__Historic_.csv
- In
profiling_code/summon
, for initial profiling:- run
chmod +x initial_profiling.sh
- run
./initial_profiling.sh
- run
hdfs dfs -cat /user/summon/hw7/output/part-r-00000
to check the result
- run
- In
etl_code/summon
, for initial cleaning:- run
chmod +x part2.sh
- run
./part2.sh
- the result is in
/user/summon/hw7/output1/part-r-00000
- moved the resulting data in step 2 to
/user/summon/hw7/input
as input for post-cleaning profiling:- run
hdfs dfs -mv hw7/output1/part-r-00000 hw7/input
- run
- Renamed the data file:
hdfs dfs -mv hw7/input/part-r-00000 hw7/input/cleaned_summon.csv
- run
- In
profiling_code/summon
, for profiling of cleaned data:- run
chmod +x part1.sh
- run
./part1.sh
- run
hdfs dfs -cat /user/summon/hw7/output_on_cleaned/part-r-00000
- run
- Move the resulting csv file (moved and renamed in step 2) to home directory on hdfs for next step
- run
hdfs dfs -rm /user/summon/hw8/cleaned_summon.csv
- run
hdfs dfs -mv /user/summon/hw7/input/cleaned_summon.csv /user/summon/hw8
- run
- In
profiling_code/summon
, for further cleaning and profiling in pySpark:- run
module load python/gcc/3.7.9
- run
PYTHONSTARTUP=summon_profile_clean.py pyspark --deploy-mode client
- run
- Exit spark, then upload the result csv file
summon_joined_ready.csv
to HDFS:- run
hdfs dfs -put summon_joined_ready.csv /user/summon
- run
[ Using summon_jointed_read.csv, obtain merge_table.csv
from the join table step]
- In
ana_code
, for analysis the summon dataset alone:module load python/gcc/3.7.9
PYTHONSTARTUP=summon_ana.py pyspark --deploy-mode client
Data Source 3: NYC OpenData: Homeless Encampments
Input Location: /user/homeless/hw8/input/Homeless_Encampments.csv
in HDFS
- In
profiling_code/homeless
, for initial profiling:- run
chmod +x initial_profiling.sh
- run
./initial_profiling.sh
- run
- In
etl_code/homeless
, for initial cleaning:- run
chmod +x clean_dataset.sh
- run
./clean_dataset.sh
- run
- In
profiling_code/homeless
, for profiling of initial cleaning:- run
hdfs dfs -mv /user/homeless/hw8/output/clean/part-r-00000 /user/homeless/hw8/input/cleaned_homeless.csv
- run
chmod +x sec_profiling.sh
- run
./sec_profiling.sh
- run
- In
profiling_code/homeless
, for further cleaning and profiling in pySpark:- run
module load python/gcc/3.7.9
- run
PYTHONSTARTUP=homeless_profile_clean.py pyspark --deploy-mode client
- run
- Upload the result csv file
homeless_joined_ready.csv
to HDFS:- run
hdfs dfs -put homeless_joined_ready.csv /user/homeless
- run
- in
ana_code
, for analysis the homeless dataset alone:- run
module load python/gcc/3.7.9
- run
PYTHONSTARTUP=homeless_anal.py pyspark --deploy-mode client
- run
This is part is done after the individual merge-ready datasets are uploaded to the HDFS of user temperature.
- In
data_ingest
:- run
beeline
- run
!connect jdbc:hive2://hm-1.hpc.nyu.edu:10000/
and input username and password - run
use <netid>
- run the commands in
hive_command.sql
to merge tables
- run
- In Peel, to export the hive table as a csv file:
- run
beeline -u jdbc:hive2://hm-1.hpc.nyu.edu:10000/ -n temperature --outputformat=csv2 --showHeader=false -e 'use temperature; select * from merged' | sed 's/[\\t]/,/g' > merged_table.csv
- run
- Upload the result csv file
merged_table.csv
to HDFS:- run
hdfs dfs -put merged_table.csv /user/temperature
- run
- In
ana_code
, for analysis of the merged datasets:- run
module load python/gcc/3.7.9
- run
PYTHONSTARTUP=merged_ana.py pyspark --deploy-mode client
- run