A comprehensive data science toolkit for analyzing water quality parameters, performing statistical analysis, and building predictive models.
Supports exploratory data analysis, correlation studies, and machine learning-based water quality assessment.
Professional-grade water quality monitoring solution for researchers and environmental scientists.
Live Demo · Documentation · Dataset · Issues
Share Water Quality Research
🌊 Advancing water quality monitoring through data science. Built for environmental research and monitoring.
Tip
This project demonstrates advanced data analysis techniques for environmental monitoring, combining statistical analysis with machine learning for water quality assessment.
Our comprehensive water quality dataset contains 500 samples with 6 key parameters:
Parameter | Range | Unit | Description |
---|---|---|---|
pH | 6.83 - 7.48 | pH units | Acidity/alkalinity measure |
Temperature | 20.3 - 23.6 | °C | Water temperature |
Turbidity | 3.1 - 5.1 | NTU | Water clarity measure |
Dissolved Oxygen | 6.0 - 9.9 | mg/L | Oxygen content in water |
Conductivity | 316 - 370 | µS/cm | Electrical conductivity |
Note
All parameters are within acceptable ranges for most water quality standards, making this dataset ideal for correlation and predictive modeling studies.
📈 Statistical Summary
Statistical Summary of Water Quality Parameters:
pH Temperature (°C) Turbidity (NTU) Dissolved Oxygen (mg/L) Conductivity (µS/cm)
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 7.161140 22.054400 4.169400 8.382200 344.362000
std 0.107531 0.903123 0.397492 0.822396 13.038672
min 6.830000 20.300000 3.100000 6.000000 316.000000
25% 7.080000 21.200000 3.800000 7.800000 333.000000
50% 7.160000 22.200000 4.200000 8.400000 344.000000
75% 7.250000 22.900000 4.500000 9.100000 355.000000
max 7.480000 23.600000 5.100000 9.900000 370.000000
Advanced statistical analysis capabilities including descriptive statistics, correlation analysis, and data quality assessment.
Core Analytics:
- 📈 Statistical Summaries: Complete descriptive statistics for all parameters
- 🔍 Data Quality Assessment: Missing value detection and data type analysis
- 📊 Distribution Analysis: Understanding parameter distributions and outliers
- 🎯 Correlation Studies: Strong correlation discovered (r=0.705) between pH and dissolved oxygen
Professional-grade visualizations for water quality parameter analysis and reporting.
Visualization Suite:
- 📊 Distribution Plots: Histograms with KDE curves for parameter distributions
- 🔗 Correlation Matrices: Heatmaps showing parameter relationships
- 📈 Scatter Plots: Relationship analysis with trend lines
- 🎭 Pair Plots: Comprehensive multi-parameter visualization
- 📉 Regression Plots: Linear relationship visualization with confidence intervals
Machine learning models for water quality parameter prediction and assessment.
ML Capabilities:
- 🎯 Linear Regression: Predict conductivity using multiple parameters
- 📊 Statistical Modeling: OLS regression with detailed statistical summaries
- 🔮 Parameter Prediction: Multi-feature models for water quality forecasting
- 📈 Model Evaluation: Comprehensive model performance assessment
- 🐍 Python-Powered: Built with pandas, scikit-learn, and seaborn
- 📓 Jupyter Integration: Interactive analysis environment
- 📊 Statistical Analysis: Advanced statistical modeling with statsmodels
- 🎨 Professional Plots: Publication-ready visualizations
- 🔍 Exploratory Data Analysis: Comprehensive EDA workflow
- 📈 Trend Analysis: Time-series and parameter trend identification
- 📋 Automated Reports: Structured analysis output
- 🔬 Research-Ready: Suitable for academic and professional research
Core Libraries:
- Python 3.x: Main programming language
- Pandas: Data manipulation and analysis
- NumPy: Numerical computing foundation
- Matplotlib: Static plotting and visualization
- Seaborn: Statistical data visualization
Advanced Analytics:
- Scikit-learn: Machine learning algorithms
- Statsmodels: Statistical modeling and testing
- Plotly: Interactive visualizations
- Jupyter: Interactive development environment
Visualization Stack:
- Seaborn: Statistical plotting with beautiful defaults
- Matplotlib: Publication-quality figures
- Plotly: Interactive and dynamic charts
graph TB
subgraph "Data Input"
A[Water Quality CSV] --> B[Load Dataset]
B --> C[Data Inspection]
end
subgraph "Exploratory Analysis"
C --> D[Statistical Summary]
D --> E[Distribution Analysis]
E --> F[Correlation Analysis]
end
subgraph "Visualization"
F --> G[Histograms & KDE]
G --> H[Scatter Plots]
H --> I[Pair Plots]
I --> J[Regression Plots]
end
subgraph "Modeling"
J --> K[Linear Regression]
K --> L[Statistical Models]
L --> M[Predictions]
M --> N[Model Evaluation]
end
subgraph "Results"
N --> O[Insights & Reports]
O --> P[Visualized Results]
end
graph LR
subgraph "Parameter Relationships"
A[pH: 7.16 ± 0.11] --> B[Strong Correlation]
B --> C[Dissolved O₂: 8.38 ± 0.82]
D[Temperature: 22.1 ± 0.9°C] --> E[Weak Correlation]
E --> F[Conductivity: 344 ± 13 µS/cm]
G[Turbidity: 4.17 ± 0.40 NTU] --> H[Moderate Correlation]
H --> I[Multiple Parameters]
end
Important
Ensure you have the following installed:
1. Clone Repository
git clone https://github.com/ChanMeng666/water-quality-testing-data-analysis.git
cd water-quality-testing-data-analysis
2. Install Dependencies
# Install required packages
pip install pandas numpy matplotlib seaborn scikit-learn statsmodels plotly jupyter
3. Launch Jupyter
# Start Jupyter Notebook
jupyter notebook
# Or use JupyterLab
jupyter lab
4. Open Analysis
Open my_notebook.ipynb
and run all cells to reproduce the analysis.
🎉 Success! You can now explore the water quality analysis.
# Create requirements.txt
echo "pandas>=1.3.0
numpy>=1.21.0
matplotlib>=3.4.0
seaborn>=0.11.0
scikit-learn>=1.0.0
statsmodels>=0.12.0
plotly>=5.0.0
jupyter>=1.0.0" > requirements.txt
# Install all dependencies
pip install -r requirements.txt
1. Load and Explore Data:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Load dataset
df = pd.read_csv('Water Quality Testing.csv')
print(df.head())
print(df.describe())
2. Statistical Analysis:
# Correlation analysis
correlation = df['pH'].corr(df['Dissolved Oxygen (mg/L)'])
print(f'pH-DO Correlation: {correlation:.4f}')
# Distribution analysis
plt.figure(figsize=(10, 6))
sns.histplot(df['pH'], kde=True, bins=30)
plt.title('Distribution of pH Values')
plt.show()
3. Visualization:
# Scatter plot with trend line
plt.figure(figsize=(10, 6))
sns.regplot(x='pH', y='Dissolved Oxygen (mg/L)', data=df)
plt.title('pH vs Dissolved Oxygen Relationship')
plt.show()
Predictive Modeling:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# Prepare features and target
features = ['pH', 'Temperature (°C)', 'Turbidity (NTU)', 'Dissolved Oxygen (mg/L)']
X = df[features]
y = df['Conductivity (µS/cm)']
# Train model
model = LinearRegression()
model.fit(X, y)
# Make predictions
predictions = model.predict(X)
r2 = r2_score(y, predictions)
print(f'Model R² Score: {r2:.4f}')
Statistical Modeling:
import statsmodels.api as sm
# OLS Regression
X_with_const = sm.add_constant(df['pH'])
model = sm.OLS(df['Temperature (°C)'], X_with_const).fit()
print(model.summary())
Note
Major Finding: Strong positive correlation (r=0.705) between pH and dissolved oxygen levels, indicating healthy aquatic ecosystem relationships.
Parameter Relationships:
- 🔵 pH ↔ Dissolved Oxygen: Strong positive correlation (r=0.705)
- 🟢 Temperature ↔ Conductivity: Moderate correlation
- 🟡 Turbidity ↔ Other Parameters: Various weak to moderate correlations
Model Performance:
- 📊 Conductivity Prediction: Multi-parameter linear model shows good predictive capability
- 📈 Statistical Significance: Most relationships show statistical significance (p<0.05)
This toolkit is perfect for:
- 🌊 Water Quality Assessment: Comprehensive parameter analysis
- 🏭 Environmental Impact Studies: Industrial discharge monitoring
- 🔬 Research Projects: Academic water quality research
- 📊 Regulatory Compliance: Meeting environmental standards
- 🌱 Ecosystem Health: Aquatic ecosystem monitoring
- 👨🎓 Data Science Education: Real-world dataset for learning
- 📚 Environmental Science: Practical water quality analysis
- 🔢 Statistics Teaching: Applied statistical analysis examples
- 🤖 Machine Learning: Environmental prediction modeling
🔍 Key Statistical Findings
Correlation Matrix Results:
Parameter Correlations (Selected):
pH ↔ Dissolved Oxygen: 0.705 (Strong)
pH ↔ Temperature: 0.151 (Weak)
Temperature ↔ Conductivity: Variable
Turbidity ↔ pH: Negative correlation
Distribution Characteristics:
- pH: Near-normal distribution (mean: 7.16)
- Temperature: Normal distribution (mean: 22.1°C)
- Dissolved Oxygen: Right-skewed distribution
- Conductivity: Normal distribution with some outliers
Model Performance:
- Linear Regression R²: ~0.65 for conductivity prediction
- OLS Model Significance: Most parameters show p<0.01
- Prediction Accuracy: Good for environmental monitoring standards
Step-by-Step Execution:
# 1. Navigate to project directory
cd water-quality-testing-data-analysis
# 2. Start Jupyter
jupyter notebook
# 3. Open main notebook
# Click on 'my_notebook.ipynb'
# 4. Run all cells
# Kernel -> Restart & Run All
Analysis Structure:
Analysis Workflow:
├── Data Loading & Inspection
├── Descriptive Statistics
├── Distribution Analysis
├── Correlation Studies
├── Visualization Suite
├── Machine Learning Models
├── Statistical Modeling
└── Results & Insights
Extend the Analysis:
# Add new visualization
def create_custom_plot(df, param1, param2):
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x=param1, y=param2, alpha=0.6)
plt.title(f'{param1} vs {param2} Analysis')
return plt.gca()
# Add new statistical test
from scipy import stats
def parameter_significance_test(df, param1, param2):
correlation, p_value = stats.pearsonr(df[param1], df[param2])
return {
'correlation': correlation,
'p_value': p_value,
'significant': p_value < 0.05
}
We welcome contributions to improve water quality analysis capabilities!
1. Fork & Clone:
git clone https://github.com/ChanMeng666/water-quality-testing-data-analysis.git
cd water-quality-testing-data-analysis
2. Create Feature Branch:
git checkout -b feature/new-analysis-method
3. Make Improvements:
- Add new analysis techniques
- Improve visualizations
- Enhance documentation
- Add new datasets
4. Submit Pull Request:
- Provide clear description
- Include example outputs
- Update documentation
- 🔍 New Analysis Methods: Time series analysis, anomaly detection
- 📊 Enhanced Visualizations: Interactive plots, dashboard creation
- 🤖 Advanced Models: Neural networks, ensemble methods
- 📚 Documentation: Tutorials, use case examples
- 🌊 New Datasets: Additional water quality parameters
This project is licensed under the MIT License - see the LICENSE file for details.
Open Source Benefits:
- ✅ Commercial use allowed
- ✅ Modification allowed
- ✅ Distribution allowed
- ✅ Private use allowed
Chan Meng
LinkedIn: chanmeng666
GitHub: ChanMeng666
Email: [email protected]
Website: chanmeng.live
🔧 Common Issues
Missing Dependencies:
# Install missing packages
pip install --upgrade pandas numpy matplotlib seaborn scikit-learn
Jupyter Not Starting:
# Reinstall Jupyter
pip install --upgrade jupyter
jupyter --version
Data Loading Errors:
- Ensure
Water Quality Testing.csv
is in the project directory - Check file encoding (should be UTF-8)
- Verify CSV delimiter is comma
Visualization Problems:
# Reset matplotlib backend
import matplotlib
matplotlib.use('inline') # For Jupyter
# or
matplotlib.use('TkAgg') # For standalone
Memory Issues:
- Use data sampling for large datasets
- Clear variables between analyses:
del variable_name
- Restart Jupyter kernel periodically
Empowering environmental scientists and researchers worldwide
⭐ Star us on GitHub • 📖 Read the Documentation • 🐛 Report Issues • 💡 Request Features • 🤝 Contribute
Made with ❤️ for the environmental science community