Skip to content

Latest commit

 

History

History
128 lines (96 loc) · 5.09 KB

README.md

File metadata and controls

128 lines (96 loc) · 5.09 KB

PLACEHOLDER README

Python environment setup with Conda

  1. Create a Python 3.8 environment, with conda or otherwise:
conda create -n cellotape python=3.8 -y
conda activate cellotape
  1. Install dependencies:
bash ./setup.sh

you must have the cuda toolkit & driver installed for the cuda version you use and set the CUDA_HOME variable

Download all artefacts

You must have unzip installed (sudo apt install unzip)

bash ./download_scripts/download_all.sh

1. Download TAG datasets

Get (A) and (B) by running script:

  • ogbn-arxiv | bash download_scripts/ogbn_arxiv_orig_download_data.sh
  • ogbn-products (subset) | bash download_scripts/ogbn_products_download_data.sh
  • arxiv_2023 | bash download_scripts/arxiv_2023_download_data.sh
  • Cora | bash download_scripts/cora_download_data.sh
  • PubMed | bash download_scripts/pubmed_download_data.sh

A. Original text attributes

Dataset Description
ogbn-arxiv The OGB provides the mapping from MAG paper IDs into the raw texts of titles and abstracts.
Download the dataset here, unzip and move it to dataset/ogbn_arxiv_orig.
ogbn-products (subset) The dataset is located under dataset/ogbn_products_orig.
arxiv_2023 Download the dataset here, unzip and move it to dataset/arxiv_2023_orig.
Cora Download the dataset here, unzip and move it to dataset/cora_orig.
PubMed Download the dataset here, unzip and move it to dataset/PubMed_orig.

B. LLM responses

Dataset Description
ogbn-arxiv Download the dataset here, unzip and move it to gpt_responses/ogbn-arxiv.
ogbn-products (subset) Download the dataset here, unzip and move it to gpt_responses/ogbn-products.
arxiv_2023 Download the dataset here, unzip and move it to gpt_responses/arxiv_2023.
Cora Download the dataset here, unzip and move it to gpt_responses/cora.
PubMed Download the dataset here, unzip and move it to gpt_responses/PubMed.

2. LM Stage / Generate Embeddings

To download embeddings

# python
import gdown
gdown.download_folder('https://drive.google.com/drive/folders/1hzTCaXh6qtZgoOC6_VPVZOBsA_fKcBft?usp=drive_link', quiet=False)

To just generate and save embeddings

# one of ['cora' 'pubmed' 'ogbn-arxiv' 'arxiv_2023' 'ogbn-products']
python -m core.LMs.generate_embeddings \
--dataset_name ogbn-arxiv \
--lm_model_name Alibaba-NLP/gte-Qwen1.5-7B-instruct \
--add_instruction graph-aware   # adds task-specific instruction to text

To fine-tune using the orginal text attributes

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv

To fine-tune using the GPT responses

WANDB_DISABLED=True TOKENIZERS_PARALLELISM=False CUDA_VISIBLE_DEVICES=0,1,2,3 python -m core.trainLM dataset ogbn-arxiv lm.train.use_gpt True

3. Training the GNNs

To use different GNN models

python -m core.trainEnsemble gnn.model.name MLP
python -m core.trainEnsemble gnn.model.name GCN
python -m core.trainEnsemble gnn.model.name SAGE
python -m core.trainEnsemble gnn.model.name RevGAT gnn.train.lr 0.002 gnn.train.dropout 0.75

To use different types of features

# Our enriched features
python -m core.trainEnsemble gnn.train.feature_type TA_P_E

# Our individual features
python -m core.trainGNN gnn.train.feature_type TA
python -m core.trainGNN gnn.train.feature_type E
python -m core.trainGNN gnn.train.feature_type P

# OGB features
python -m core.trainGNN gnn.train.feature_type ogb

(Example) use only TA embeddings from LLM embedding model

python -m core.trainEnsemble gnn.train.feature_type TA dataset arxiv_2023 seed 42 gnn.model.name SAGE

4. Reproducibility

Use run.sh to run the codes and reproduce the published results.

This repository also provides the checkpoints for all trained models (*.ckpt) and the TAPE features (*.emb) used in the project. Please donwload them here.

arxiv-2023 dataset

The codes for constructing and processing the arxiv-2023 dataset are provided here.

Running Tests:

PYTHONPATH=. pytest tests/