2021-2022, The Hong Kong polytechnic University
License: MIT (see LICENSE.md)
Authors: JIANG Maiqi
This repository (Heterogeneous Information Networks Baselines) is a collection of baseline models for heterogeneous information networks (HINs). The models are implemented in PyTorch.
It is based on several repositories, including DiffMG,
GAT, GCN, HAN
HGT, and MAGNN.
HGT repository also contains RGCN model. So there are 7 models in total.
This repository tries to unify the experiment and dataset settings and make them easier to use.
The experiment is running on GPU. So you may need to download cuda and cudnn from Nvidia:
cuda ~= 11.3
cudnn ~= 8.2.0
It is recommended to use Linux. The code is working on python. The python environment is as follows:
python ~= 3.8.13
torch ~= 1.11.0 # You need to download the correct version following your Cuda version.
numpy ~= 1.21.5
scikit-learn ~= 1.0.2
scipy ~= 1.7.3
It is suggested that you can create each conda environment named model name on down case for each model. Our sh files for Linux always contain:
conda activate (lower case model name) # (lower case model name) is the name of the environment.
conda deactivate
So, if you are running these files, make sure that you use the correct environment. If you are in the current python environment, you can just delete them. If you use another name, you can modify it.
In some models, you may need to install the packages of PyG and DGL. Please check the original model repositories (in Reference parts) for further environment details.
You can download the processed data from here, which contains three datasets: ACM, DBLP and IMDB. It is provided from this repository. If you use the data, please cite their work.
Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, Hyunwoo J. Kim, Graph Transformer Networks, In Advances in Neural Information Processing Systems (NeurIPS 2019).
After downloading the data, unzip it and copy three folders (ACM/DBLP/IMDB) to ./data
, which also has these three folders.
For each dataset, it consists of edges.pkl, labels.pkl, and node_features.pkl. And for our repository,
we provide each dataset's node types file: node_types.npy. You can also generate it from edges.pkl by yourselves.
If you want to process data on your own, the original dataset is generated by HAN repository. They offer their cite style
@article{han2019,
title={Heterogeneous Graph Attention Network},
author={Xiao, Wang and Houye, Ji and Chuan, Shi and Bai, Wang and Peng, Cui and P. , Yu and Yanfang, Ye},
journal={WWW},
year={2019}
}
The processed data labels.pkl
also offer the split of training, validation and test. But it is fixed. We provide
our split on five-fold cross validation which is shown on each dataset as labels_5_fold_cross_validation_{which fold for test}.pkl
.
It is applied to the 100% training set. In addition, we also provide other percentage: 10% (0.1), 25% (0.25), 50% (0.5).
The path is ./data/{dataset}/{percentage: 0.1, 0.25 and 0.5}/{percentage}_labels_5_fold_cross_validation_{fold}.pkl
.
The split is randomly chosen by controlling a random seed. We provide our code for the split: ./split/try_split.py
and
./split/percentage.py
. Here we introduce how to use if you want to use it in further study.
If you are in a Linux environment, you
can generate split directly:
cd split
bash -i split.sh
python percentage.py
If you are not Linux, here is the replacement method:
cd split
python try_split.py --label_path ../data/IMDB/labels.pkl --output_label_path_style ../data/IMDB/labels_{k}_fold_cross_validation_{number}.pkl
python try_split.py --label_path ../data/ACM/labels.pkl --output_label_path_style ../data/ACM/labels_{k}_fold_cross_validation_{number}.pkl
python try_split.py --label_path ../data/DBLP/labels.pkl --output_label_path_style ../data/DBLP/labels_{k}_fold_cross_validation_{number}.pkl
python percentage.py
We provide some arguments for try_split.py
for building your split. Notice that it may change the format
of the file names, so other codes may need to be modified for using the new format.
--seed SEED random seed, default 0
--label_path LABEL_PATH origin label path
--output_label_path_style OUTPUT_LABEL_PATH_STYLE
it is output label path style, you need to give {k}, {number} in the style
--k K k fold cross validation
As for percentage.py
, it is running to directly process all three datasets on the default setting. Another mode for it is
that you can build your ones which target on single labels file.
--default DEFAULT if True, it will process all the default datasets with 5 fold and default path, you do not need to provide other args
--seed SEED random seed
--label_path LABEL_PATH
label path
--output_label_path_style OUTPUT_LABEL_PATH_STYLE
it is output label path style, you need to give {percentage}, {ori_name} in the style
--percentage PERCENTAGE
the percentage for training set
You can simply train on all platforms as follows:
cd (model name)
bash -i (lower case model name).sh ./ 0 # {results folder} and {which gpu for use}
bash -i percentage.sh ./ 0 # {results folder} and {which gpu for use}
It will record different percentage results on {results folder}/{percentage: 0.1, 0.25, 0.5}/{lower case model name}.csv
and 100% result
{results folder}/results.csv
.
It is simple to run independently. For example, DiffMG has two parts, namely search and retrain part, which are located on ./DiffMG/train_search_{Dataset}.py
and ./AutoGNRModel/train_{Dataset}.py
. Here is an example of an independent one run on ACM:
cd DifffMG
python train_search_ACM.py --gpu 0 --seed 0 --label ../data/ACM/labels_5_fold_cross_validation_0.pkl --result ./
python train_ACM.py --gpu 0N --seed 0 --label ../data/ACM/labels_5_fold_cross_validation_0.pkl --result ./
You can change hyper parameters for search (train_search_{Dataset}.py
) using arguments. Similarly, retrain
part (train_{Dataset}.py
) can be modified by arguments.
There are so many results collected. We write a simple tool to calculate the average results and deviation. A direct use example:
cd AutoGNRModel/{percentage}
python ../../csvprocess/result.py --csv results.csv --output statistics.csv
More details for this tool:
python ./csvprocess/result.py --help
If you use the code in your research, please cite the following paper:
DiffMG:
@inproceedings{diffmg,
title={DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neural Networks},
author={Ding, Yuhui and Yao, Quanming and Zhao, Huan and Zhang, Tong},
booktitle={Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
year={2021}
}
GAT:
@article{
velickovic2018graph,
title="{Graph Attention Networks}",
author={Veli{\v{c}}kovi{\'{c}}, Petar and Cucurull, Guillem and Casanova, Arantxa and Romero, Adriana and Li{\`{o}}, Pietro and Bengio, Yoshua},
journal={International Conference on Learning Representations},
year={2018},
url={https://openreview.net/forum?id=rJXMpikCZ},
note={accepted as poster},
}
GCN:
@article{kipf2016semi,
title={Semi-Supervised Classification with Graph Convolutional Networks},
author={Kipf, Thomas N and Welling, Max},
journal={arXiv preprint arXiv:1609.02907},
year={2016}
}
HAN:
@article{han2019,
title={Heterogeneous Graph Attention Network},
author={Xiao, Wang and Houye, Ji and Chuan, Shi and Bai, Wang and Peng, Cui and P. , Yu and Yanfang, Ye},
journal={WWW},
year={2019}
}
HGT:
@inproceedings{hgt,
author = {Ziniu Hu and
Yuxiao Dong and
Kuansan Wang and
Yizhou Sun},
title = {Heterogeneous Graph Transformer},
booktitle = {{WWW} '20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020},
pages = {2704--2710},
publisher = {{ACM} / {IW3C2}},
year = {2020},
url = {https://doi.org/10.1145/3366423.3380027},
doi = {10.1145/3366423.3380027},
timestamp = {Wed, 06 May 2020 12:56:16 +0200},
biburl = {https://dblp.org/rec/conf/www/HuDWS20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
RGCN:
@inproceedings{schlichtkrull2018modeling,
title={Modeling relational data with graph convolutional networks},
author={Schlichtkrull, Michael and Kipf, Thomas N and Bloem, Peter and Berg, Rianne van den and Titov, Ivan and Welling, Max},
booktitle={European semantic web conference},
pages={593--607},
year={2018},
organization={Springer}
}
MAGNN:
@inproceedings{fu2020magnn,
title={MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding},
author={Xinyu Fu and Jiani Zhang and Ziqiao Meng and Irwin King},
booktitle = {WWW},
year={2020}
}