Relationship Between Temperature, Homeless Encampments, and Criminal Activities in New York City (2018 - 2019)

Data Source 1: U.S. Local Climatological Data: NY CITY CENTRAL PARK

Input Location: /user/temperature/hw7/input/temp2018.csv AND /user/temperature/hw7/input/temp2019.csv in HDFS

In profiling_code/temperature, for initial profiling:
- run chmod +x initial_profiling.sh
- run ./initial_profiling.sh
- run hdfs dfs -cat /user/temperature/hw7/outputPart1_1/part-r-00000 to check the result
In etl_code/temperature, for initial cleaning:
- run chmod +x part2.sh
- run ./part2.sh
In profiling_code/temperature, for profiling of the initial cleaning:
- run chmod +x part1.sh
- run ./part1.sh
- run hdfs dfs -cat /user/temperature/hw7/outputPart1_1/part-r-00000 to check the result
Turn the result into a csv file:
- run hdfs dfs -mv /user/temperature/hw7/outputPart2/part-r-00000 /user/temperature/cleaned_temperature.csv
In profiling_code/temperature, for further cleaning and profiling in pySpark:
- run module load python/gcc/3.7.9
- run PYTHONSTARTUP=temperature_profile_clean.py pyspark --deploy-mode client
- run exit()
Upload the result csv file temperature_joined_ready.csv to HDFS:
- run hdfs dfs -put temperature_joined_ready.csv /user/temperature
In ana_code, for analysis the temperature dataset:
- run module load python/gcc/3.7.9
- run PYTHONSTARTUP=temperature_ana.py pyspark --deploy-mode client

Data Source 2: NYPD Criminal Court Summons (Historic)

Initial Cleaning and Profiling: MapReduce

Input Location on hdfs: /user/summon/hw7/input/NYPD_Criminal_Court_Summons__Historic_.csv

In profiling_code/summon, for initial profiling:
- run chmod +x initial_profiling.sh
- run ./initial_profiling.sh
- run hdfs dfs -cat /user/summon/hw7/output/part-r-00000 to check the result
In etl_code/summon, for initial cleaning:
- run chmod +x part2.sh
- run ./part2.sh
- the result is in /user/summon/hw7/output1/part-r-00000
- moved the resulting data in step 2 to /user/summon/hw7/input as input for post-cleaning profiling:
  - run hdfs dfs -mv hw7/output1/part-r-00000 hw7/input
- Renamed the data file: hdfs dfs -mv hw7/input/part-r-00000 hw7/input/cleaned_summon.csv
In profiling_code/summon, for profiling of cleaned data:
- run chmod +x part1.sh
- run ./part1.sh
- run hdfs dfs -cat /user/summon/hw7/output_on_cleaned/part-r-00000
Move the resulting csv file (moved and renamed in step 2) to home directory on hdfs for next step
- run hdfs dfs -rm /user/summon/hw8/cleaned_summon.csv
- run hdfs dfs -mv /user/summon/hw7/input/cleaned_summon.csv /user/summon/hw8

Further Cleaning, Profiling, and Single Dataset Analysis: PySpark

In profiling_code/summon, for further cleaning and profiling in pySpark:
- run module load python/gcc/3.7.9
- run PYTHONSTARTUP=summon_profile_clean.py pyspark --deploy-mode client
Exit spark, then upload the result csv file summon_joined_ready.csv to HDFS:
- run hdfs dfs -put summon_joined_ready.csv /user/summon

[ Using summon_jointed_read.csv, obtain merge_table.csv from the join table step]

In ana_code, for analysis the summon dataset alone:
- module load python/gcc/3.7.9
- PYTHONSTARTUP=summon_ana.py pyspark --deploy-mode client

Data Source 3: NYC OpenData: Homeless Encampments

Initial Cleaning and Profiling: MapReduce

Input Location: /user/homeless/hw8/input/Homeless_Encampments.csv in HDFS

In profiling_code/homeless, for initial profiling:
- run chmod +x initial_profiling.sh
- run ./initial_profiling.sh
In etl_code/homeless, for initial cleaning:
- run chmod +x clean_dataset.sh
- run ./clean_dataset.sh
In profiling_code/homeless, for profiling of initial cleaning:
- run hdfs dfs -mv /user/homeless/hw8/output/clean/part-r-00000 /user/homeless/hw8/input/cleaned_homeless.csv
- run chmod +x sec_profiling.sh
- run ./sec_profiling.sh

Further Cleaning, Profiling, and Single Dataset Analysis: PySpark

In profiling_code/homeless, for further cleaning and profiling in pySpark:
- run module load python/gcc/3.7.9
- run PYTHONSTARTUP=homeless_profile_clean.py pyspark --deploy-mode client
Upload the result csv file homeless_joined_ready.csv to HDFS:
- run hdfs dfs -put homeless_joined_ready.csv /user/homeless
in ana_code, for analysis the homeless dataset alone:
- run module load python/gcc/3.7.9
- run PYTHONSTARTUP=homeless_anal.py pyspark --deploy-mode client

Merge Datasets and Further Analysis

This is part is done after the individual merge-ready datasets are uploaded to the HDFS of user temperature.

The location is /user/temperature/three_datasets

Join the 3 tables in Hive

In data_ingest:
- run beeline
- run !connect jdbc:hive2://hm-1.hpc.nyu.edu:10000/ and input username and password
- run use <netid>
- run the commands in hive_command.sql to merge tables
In Peel, to export the hive table as a csv file:
- run beeline -u jdbc:hive2://hm-1.hpc.nyu.edu:10000/ -n temperature --outputformat=csv2 --showHeader=false -e 'use temperature; select * from merged' | sed 's/[\\t]/,/g' > merged_table.csv
Upload the result csv file merged_table.csv to HDFS:
- run hdfs dfs -put merged_table.csv /user/temperature
In ana_code, for analysis of the merged datasets:
- run module load python/gcc/3.7.9
- run PYTHONSTARTUP=merged_ana.py pyspark --deploy-mode client

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
ana_code		ana_code
etl_code		etl_code
profiling_code		profiling_code
test_code		test_code
Analysis_Paper.pdf		Analysis_Paper.pdf
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Relationship Between Temperature, Homeless Encampments, and Criminal Activities in New York City (2018 - 2019)

Data Source 1: U.S. Local Climatological Data: NY CITY CENTRAL PARK

Data Source 2: NYPD Criminal Court Summons (Historic)

Initial Cleaning and Profiling: MapReduce

Further Cleaning, Profiling, and Single Dataset Analysis: PySpark

Data Source 3: NYC OpenData: Homeless Encampments

Initial Cleaning and Profiling: MapReduce

Further Cleaning, Profiling, and Single Dataset Analysis: PySpark

Merge Datasets and Further Analysis

This is part is done after the individual merge-ready datasets are uploaded to the HDFS of user temperature.

The location is /user/temperature/three_datasets

Join the 3 tables in Hive

About

Uh oh!

Releases

Packages

Languages

License

Bernice55231/Analysis-on-Temperature-Criminal-summons-and-Homelessness-in-NYC

Folders and files

Latest commit

History

Repository files navigation

Relationship Between Temperature, Homeless Encampments, and Criminal Activities in New York City (2018 - 2019)

Data Source 1: U.S. Local Climatological Data: NY CITY CENTRAL PARK

Data Source 2: NYPD Criminal Court Summons (Historic)

Initial Cleaning and Profiling: MapReduce

Further Cleaning, Profiling, and Single Dataset Analysis: PySpark

Data Source 3: NYC OpenData: Homeless Encampments

Initial Cleaning and Profiling: MapReduce

Further Cleaning, Profiling, and Single Dataset Analysis: PySpark

Merge Datasets and Further Analysis

This is part is done after the individual merge-ready datasets are uploaded to the HDFS of user temperature.

The location is /user/temperature/three_datasets

Join the 3 tables in Hive

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages