NeurIPS Open Polymer Prediction 2025 - Complete Solution Documentation

Competition Overview

The NeurIPS Open Polymer Prediction 2025 competition was to predict five polymer properties from SMILES (Simplified Molecular Input Line Entry System) representations:

Tg: Glass Transition Temperature
Tc: Crystallization Temperature
FFV: Fractional Free Volume
Density: Polymer Density
Rg: Radius of Gyration

The evaluation metric was weighted Mean Absolute Error (wMAE), with FFV having roughly 10 times the weight of other properties, making it important in the final score.

My Approach and Solution Architecture

After not getting promising results from Transformers-based and GB-based models, I developed a comprehensive multi-model approach, using different Graph Neural Network (GNN) architectures optimized for different polymer properties based on their unique characteristics and data distributions.

Property-Specific Model Selection

1. Rg and Density → MyGNN (Custom Implementation)

Location: /MY_GNN/
Architecture: Custom Graph Neural Network designed specifically for polymer molecular graphs
Design Philosophy: These properties are directly related to molecular structure and spatial arrangements, requiring a custom approach that could capture geometric and topological features effectively
Key Features:
- Custom message passing mechanism
- Specialized node and edge feature representations

2. FFV and Tg → MolecularGNN_SMILES

Location: /NIPS_GNN/
Base Repository: masashitsubaki/molecularGNN_smiles
Architecture: Graph Neural Network based on learning representations of r-radius subgraphs (molecular fingerprints)
Design Philosophy: FFV and Tg are thermal and mechanical properties that correlate well with local chemical environments and substructural patterns
Key Features:
- Fingerprints representation learning
- Proven performance on molecular property prediction tasks

3. Tc → DataAugmentation4SmallData (Modified)

Location: /DA_GNN/
Base Repository: hkqiu/DataAugmentation4SmallData
Architecture: Neural network with data augmentation techniques for small datasets
Modifications Made:
- Adjusted layer sizes for deeper polymer-specific features
- Modified augmentation strategies for chemical data

Technical Implementation Details

Data Preprocessing

Input Format: SMILES strings representing polymer structures
Graph Construction: Molecular graphs with atoms as nodes and bonds as edges
Feature Engineering using rdkit and networkx modules:
- Atomic features (element type, hybridization, formal charge, etc.) and Bond features (bond type, conjugation, ring membership, etc.)
- Global molecular descriptors where applicable

Model Training Strategy

Cross-Validation: 5-fold cross-validation for robust performance estimation
Optimization: Adam optimizer with learning rate scheduling
Loss Function: Mean Absolute Error (MAE) to match competition metric
Early Stopping: Implemented to prevent overfitting

Performance Achieved

Best Public LB Score: 0.065 wMAE, which later scored 0.083 wMAE on the Private LB, among the top 10 performers.
Model Type: Ensemble of the three GNN approaches

Repository Structure

NeurIPS-Open-Polymer-Prediction-2025/
├── MY_GNN/                      # Custom GNN for Rg and Density
│   ├── train.py               # Training Script
│   ├── inference.py           # Inference Script
│   └── trained_models/        # Trained Models
├── NIPS_GNN/                    # MolecularGNN for FFV and Tg
├── DA_GNN/                      # Data Augmentation GNN for Tc
├── notebooks/                 # Jupyter submitted notebooks
├── Datasets/                    # Competition Datasets and External ones as well, along with training scripts
└── README.md                  # This file

What Went Wrong: The Final Day Mistake

Despite achieving a strong public leaderboard score of 0.065 with my GNN ensemble, I made a critical error on the final submission day that cost me the competition.

On the last day of the competition, influenced by discussion threads suggesting that models performing poorly on the public leaderboard (which used only ~8% of test data) might perform better on the private leaderboard (remaining ~92%), I decided to submit a different, inferior model (of 0.070 Public LB Score) instead of my best-performing GNN solution.

The Reasoning (Flawed)

Public Leaderboard Overfitting Concerns: The competition discussion was filled with warnings about leaderboard overfitting due to the small public test split
Last-Minute Decision: I chose to submit what I thought was a "safer" model

The Reality

Private Leaderboard Results: The model I submitted performed significantly worse on the private leaderboard
Statistical Truth: In most Kaggle competitions, the discrepancy between public and private leaderboards is typically only 5-10%
Lesson Learned: Strong cross-validation and consistent public performance are usually the best predictors of private performance

Impact

This single decision transformed what could have been a successful competition result into a disappointing outcome, despite months of dedicated work and model development.

Umm, Key Learnings and Insights

Technical Insights

Property-Specific Modeling: Different polymer properties benefit from different neural network architectures
Data Augmentation Value: For properties with limited data (like Tc), augmentation techniques are crucial
Ensemble Benefits: Combining specialized models for different properties improves overall performance

Competition Strategy Insights

Trust Your CV: Strong CV performance is usually the best predictor of final performance
Avoid Last-Minute Changes: Stick to your best-validated approach rather than making strategy changes under pressure
Discussion Forum Caution: While community discussions provide valuable insights, they can also lead to overthinking and poor decisions

Conclusion

This competition was both a technical challenge and a lesson in decision-making under pressure.

While my GNN ensemble achieved strong performance (0.065 wMAE), the final submission mistake serves as a reminder that technical excellence must be paired with sound strategic judgment.

The multi-model approach proved effective, with each specialized GNN architecture capturing different aspects of polymer structure-property relationships.

Resources and References

Competition Links

Referenced Repositories

Key Papers

"Compound-protein Interaction Prediction with End-to-end Learning of Neural Networks for Graphs and Sequences" (Tsubaki et al.)
"Heat-Resistant Polymer Discovery by Utilizing Interpretable Graph Neural Network with Small Data" (Haoke Qiu, Jingying Wang, ...)

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.github/workflows		.github/workflows
DA_GNN		DA_GNN
Datasets		Datasets
MY_GNN		MY_GNN
NIPS_GNN		NIPS_GNN
kaggle_inputs		kaggle_inputs
kaggle_kernel		kaggle_kernel
results		results
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
app.py		app.py
prediction.py		prediction.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NeurIPS Open Polymer Prediction 2025 - Complete Solution Documentation

Competition Overview

My Approach and Solution Architecture

Property-Specific Model Selection

1. Rg and Density → MyGNN (Custom Implementation)

2. FFV and Tg → MolecularGNN_SMILES

3. Tc → DataAugmentation4SmallData (Modified)

Technical Implementation Details

Data Preprocessing

Model Training Strategy

Performance Achieved

Repository Structure

What Went Wrong: The Final Day Mistake

The Reasoning (Flawed)

The Reality

Impact

Umm, Key Learnings and Insights

Technical Insights

Competition Strategy Insights

Conclusion

Resources and References

Competition Links

Referenced Repositories

Key Papers

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Gaurav-Kushwaha-1225/NeurIPS-Open-Polymer-Prediction-2025

Folders and files

Latest commit

History

Repository files navigation

NeurIPS Open Polymer Prediction 2025 - Complete Solution Documentation

Competition Overview

My Approach and Solution Architecture

Property-Specific Model Selection

1. Rg and Density → MyGNN (Custom Implementation)

2. FFV and Tg → MolecularGNN_SMILES

3. Tc → DataAugmentation4SmallData (Modified)

Technical Implementation Details

Data Preprocessing

Model Training Strategy

Performance Achieved

Repository Structure

What Went Wrong: The Final Day Mistake

The Reasoning (Flawed)

The Reality

Impact

Umm, Key Learnings and Insights

Technical Insights

Competition Strategy Insights

Conclusion

Resources and References

Competition Links

Referenced Repositories

Key Papers

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages