GitHub - Priscillaod/DataCleaning-NatHousing

Nashville Housing Data Cleaning Project

Project Overview

This project focuses on data cleaning using SQL to prepare the Nashville Housing dataset for further analysis. The goal was to transform messy, incomplete real-world housing data into a clean, well-structured format that would support accurate insights and business decision-making.

Business Objective

Ensure data consistency and integrity for future housing market analysis.
Address missing values, duplicate records, and formatting inconsistencies.
Standardize address fields to enable easier geographic analysis.
Prepare a clean and analysis-ready database.

Data Source

Dataset: Nashville Housing Data (Excel file).
Source: Provided as part of a portfolio data cleaning exercise.

Methodology & Step-by-Step Cleaning Process

Initial Exploration

Queried the full dataset to assess structure, completeness, and potential data quality issues.

Standardized Date Format

Converted SaleDate field from datetime to date format for clarity and consistency.
Created a new column SaleDateConverted.

Filled Missing Property Addresses

Identified NULL values in PropertyAddress.
Used self-join on ParcelID to populate missing addresses from matching records.

Split Property Address into Components

Extracted street address and city into new columns: PropertySplitAddress and PropertySplitCity.

Split Owner Address into Components

Used the PARSENAME function to separate OwnerAddress into street address, city, and state.

Standardized SoldAsVacant Values

Standardized values in the SoldAsVacant field from 'Y'/'N' to 'Yes'/'No' for better readability.

Removed Duplicate Records

Identified duplicates using ROW_NUMBER() based on critical fields.
Deleted duplicate entries while retaining the original record.

Final Deliverables

A cleaned Nashville Housing SQL table ready for analysis.
Duplicates removed, missing addresses filled, addresses split into structured fields, and standardization applied across key categorical fields.

Data Limitations

Some addresses with inconsistent formats were manually handled.
Assumes ParcelID is a unique-enough identifier for matching addresses.

Conclusion

This project demonstrates the importance of systematic data cleaning as a foundational step in the data analysis process. Through structured SQL techniques — including filling missing values, standardising formats, decomposing addresses, and removing duplicates — the Nashville Housing dataset was transformed from a messy, inconsistent collection into a clean, reliable, and analysis-ready dataset.

Effective data cleaning ensures that any future analyses, visualizations, or business insights generated from this dataset will be accurate, trustworthy, and actionable. By following a disciplined approach, we have significantly enhanced the quality and usability of the data, laying the groundwork for informed decision-making in real estate market analyses, urban planning, and investment strategy.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Nashville Housing Data for Data Cleaning.xlsx		Nashville Housing Data for Data Cleaning.xlsx
NationalHousingSQLQuery4.sql		NationalHousingSQLQuery4.sql
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Priscillaod/DataCleaning-NatHousing

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages