Heterogeneous Information Networks Baselines

2021-2022, The Hong Kong polytechnic University
License: MIT (see LICENSE.md)
Authors: JIANG Maiqi
This repository (Heterogeneous Information Networks Baselines) is a collection of baseline models for heterogeneous information networks (HINs). The models are implemented in PyTorch.

Reference

It is based on several repositories, including DiffMG, GAT, GCN, HAN HGT, and MAGNN. HGT repository also contains RGCN model. So there are 7 models in total.
This repository tries to unify the experiment and dataset settings and make them easier to use.

General Environment

The experiment is running on GPU. So you may need to download cuda and cudnn from Nvidia:

cuda ~= 11.3 
cudnn ~= 8.2.0

It is recommended to use Linux. The code is working on python. The python environment is as follows:

python ~= 3.8.13
torch ~= 1.11.0  # You need to download the correct version following your Cuda version.
numpy ~= 1.21.5
scikit-learn ~= 1.0.2
scipy ~= 1.7.3

Model Environment

It is suggested that you can create each conda environment named model name on down case for each model. Our sh files for Linux always contain:

conda activate (lower case model name)  # (lower case model name) is the name of the environment.
conda deactivate

So, if you are running these files, make sure that you use the correct environment. If you are in the current python environment, you can just delete them. If you use another name, you can modify it.

In some models, you may need to install the packages of PyG and DGL. Please check the original model repositories (in Reference parts) for further environment details.

Data

You can download the processed data from here, which contains three datasets: ACM, DBLP and IMDB. It is provided from this repository. If you use the data, please cite their work.

Seongjun Yun, Minbyul Jeong, Raehyun Kim, Jaewoo Kang, Hyunwoo J. Kim, Graph Transformer Networks, In Advances in Neural Information Processing Systems (NeurIPS 2019).

After downloading the data, unzip it and copy three folders (ACM/DBLP/IMDB) to ./data, which also has these three folders. For each dataset, it consists of edges.pkl, labels.pkl, and node_features.pkl. And for our repository, we provide each dataset's node types file: node_types.npy. You can also generate it from edges.pkl by yourselves.

If you want to process data on your own, the original dataset is generated by HAN repository. They offer their cite style

@article{han2019,
title={Heterogeneous Graph Attention Network},
author={Xiao, Wang and Houye, Ji and Chuan, Shi and  Bai, Wang and Peng, Cui and P. , Yu and Yanfang, Ye},
journal={WWW},
year={2019}
}

Split Dataset

The processed data labels.pkl also offer the split of training, validation and test. But it is fixed. We provide our split on five-fold cross validation which is shown on each dataset as labels_5_fold_cross_validation_{which fold for test}.pkl. It is applied to the 100% training set. In addition, we also provide other percentage: 10% (0.1), 25% (0.25), 50% (0.5). The path is ./data/{dataset}/{percentage: 0.1, 0.25 and 0.5}/{percentage}_labels_5_fold_cross_validation_{fold}.pkl.

The split is randomly chosen by controlling a random seed. We provide our code for the split: ./split/try_split.py and ./split/percentage.py. Here we introduce how to use if you want to use it in further study. If you are in a Linux environment, you can generate split directly:

cd split
bash -i split.sh
python percentage.py

If you are not Linux, here is the replacement method:

cd split
python try_split.py --label_path ../data/IMDB/labels.pkl --output_label_path_style ../data/IMDB/labels_{k}_fold_cross_validation_{number}.pkl
python try_split.py --label_path ../data/ACM/labels.pkl --output_label_path_style ../data/ACM/labels_{k}_fold_cross_validation_{number}.pkl
python try_split.py --label_path ../data/DBLP/labels.pkl --output_label_path_style ../data/DBLP/labels_{k}_fold_cross_validation_{number}.pkl
python percentage.py

We provide some arguments for try_split.py for building your split. Notice that it may change the format of the file names, so other codes may need to be modified for using the new format.

--seed SEED               random seed, default 0
--label_path LABEL_PATH   origin label path 
--output_label_path_style OUTPUT_LABEL_PATH_STYLE
                          it is output label path style, you need to give {k}, {number} in the style
--k K                     k fold cross validation

As for percentage.py, it is running to directly process all three datasets on the default setting. Another mode for it is that you can build your ones which target on single labels file.

--default DEFAULT     if True, it will process all the default datasets with 5 fold and default path, you do not need to provide other args
--seed SEED           random seed
--label_path LABEL_PATH
                      label path
--output_label_path_style OUTPUT_LABEL_PATH_STYLE
                      it is output label path style, you need to give {percentage}, {ori_name} in the style
--percentage PERCENTAGE
                      the percentage for training set

Train

You can simply train on all platforms as follows:

cd (model name)
bash -i (lower case model name).sh ./ 0  # {results folder} and {which gpu for use}
bash -i percentage.sh ./ 0  # {results folder} and {which gpu for use}

It will record different percentage results on {results folder}/{percentage: 0.1, 0.25, 0.5}/{lower case model name}.csv and 100% result {results folder}/results.csv.
It is simple to run independently. For example, DiffMG has two parts, namely search and retrain part, which are located on ./DiffMG/train_search_{Dataset}.py and ./AutoGNRModel/train_{Dataset}.py. Here is an example of an independent one run on ACM:

cd DifffMG
python train_search_ACM.py --gpu 0 --seed 0 --label ../data/ACM/labels_5_fold_cross_validation_0.pkl --result ./
python train_ACM.py --gpu 0N --seed 0 --label ../data/ACM/labels_5_fold_cross_validation_0.pkl --result ./

You can change hyper parameters for search (train_search_{Dataset}.py) using arguments. Similarly, retrain part (train_{Dataset}.py) can be modified by arguments.

Statistics

There are so many results collected. We write a simple tool to calculate the average results and deviation. A direct use example:

cd AutoGNRModel/{percentage}
python ../../csvprocess/result.py --csv results.csv --output statistics.csv

More details for this tool:

python ./csvprocess/result.py --help

Original Repositories Paper List

If you use the code in your research, please cite the following paper:

DiffMG:
@inproceedings{diffmg,
  title={DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neural Networks},
  author={Ding, Yuhui and Yao, Quanming and Zhao, Huan and Zhang, Tong},
  booktitle={Proceedings of the 27th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining},
  year={2021}
}
GAT:
@article{
  velickovic2018graph,
  title="{Graph Attention Networks}",
  author={Veli{\v{c}}kovi{\'{c}}, Petar and Cucurull, Guillem and Casanova, Arantxa and Romero, Adriana and Li{\`{o}}, Pietro and Bengio, Yoshua},
  journal={International Conference on Learning Representations},
  year={2018},
  url={https://openreview.net/forum?id=rJXMpikCZ},
  note={accepted as poster},
}
GCN:
@article{kipf2016semi,
  title={Semi-Supervised Classification with Graph Convolutional Networks},
  author={Kipf, Thomas N and Welling, Max},
  journal={arXiv preprint arXiv:1609.02907},
  year={2016}
}
HAN:
@article{han2019,
title={Heterogeneous Graph Attention Network},
author={Xiao, Wang and Houye, Ji and Chuan, Shi and  Bai, Wang and Peng, Cui and P. , Yu and Yanfang, Ye},
journal={WWW},
year={2019}
}
HGT:
@inproceedings{hgt,
  author    = {Ziniu Hu and
               Yuxiao Dong and
               Kuansan Wang and
               Yizhou Sun},
  title     = {Heterogeneous Graph Transformer},
  booktitle = {{WWW} '20: The Web Conference 2020, Taipei, Taiwan, April 20-24, 2020},
  pages     = {2704--2710},
  publisher = {{ACM} / {IW3C2}},
  year      = {2020},
  url       = {https://doi.org/10.1145/3366423.3380027},
  doi       = {10.1145/3366423.3380027},
  timestamp = {Wed, 06 May 2020 12:56:16 +0200},
  biburl    = {https://dblp.org/rec/conf/www/HuDWS20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}
RGCN:
@inproceedings{schlichtkrull2018modeling,
  title={Modeling relational data with graph convolutional networks},
  author={Schlichtkrull, Michael and Kipf, Thomas N and Bloem, Peter and Berg, Rianne van den and Titov, Ivan and Welling, Max},
  booktitle={European semantic web conference},
  pages={593--607},
  year={2018},
  organization={Springer}
}
MAGNN:
@inproceedings{fu2020magnn,
 title={MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding},
 author={Xinyu Fu and Jiani Zhang and Ziqiao Meng and Irwin King},
 booktitle = {WWW},
 year={2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Heterogeneous Information Networks Baselines

Reference

General Environment

Model Environment

Data

Split Dataset

Train

Statistics

Original Repositories Paper List

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.idea		.idea
DiffMG		DiffMG
GAT		GAT
GCN		GCN
HAN		HAN
HGT		HGT
MAGNN		MAGNN
csvprocess		csvprocess
data		data
split		split
LICENSE		LICENSE
README.md		README.md

License

MaiqiVerse/HINBaselines

Folders and files

Latest commit

History

Repository files navigation

Heterogeneous Information Networks Baselines

Reference

General Environment

Model Environment

Data

Split Dataset

Train

Statistics

Original Repositories Paper List

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages