PaperWork - AI4S

Code Architect

TASK: IS2RE

配置

README_is2res.md

This readme describes the file details for the IS2RE/IS2RS tasks and all its subsequent data splits.

This folder contains files organized as follows:

data/is2re/10k/train/data.lmdb
data/is2re/100k/train/data.lmdb
data/is2re/all/train/data.lmdb
data/is2re/all/val_id/data.lmdb
data/is2re/all/val_ood_ads/data.lmdb
data/is2re/all/val_ood_cat/data.lmdb
data/is2re/all/val_ood_both/data.lmdb
data/is2re/all/test_id/data.lmdb
data/is2re/all/test_ood_ads/data.lmdb
data/is2re/all/test_ood_cat/data.lmdb
data/is2re/all/test_ood_both/data.lmdb

There is additional .lmdb-lock file present alongside each .lmdb file.

data/is2re/N/M/data.lmdb is an LMDB file containing N PyTorch Geometric Data objects from adsorbate+catalyst systems in the corresponding M split. Each LMDB shall contain the following number of Data objects:

data/is2re/10k/train/data.lmdb:       10,000
data/is2re/100k/train/data.lmdb:      100,000
data/is2re/all/train/data.lmdb:      460,328
data/is2re/all/val_id/data.lmdb:      24,943
data/is2re/all/val_ood_ads/data.lmdb:    24,961
data/is2re/all/val_ood_cat/data.lmdb:    24,963
data/is2re/all/val_ood_both/data.lmdb:   24,987
data/is2re/all/test_id/data.lmdb:      24,948
data/is2re/all/test_ood_ads/data.lmdb:   24,930
data/is2re/all/test_ood_cat/data.lmdb:   24,965
data/is2re/all/test_ood_both/data.lmdb:   24,985

Each Data object includes the following information for each corresponding system (assuming K atoms):

* sid - 1 System ID corresponding to each structure

* edge_index - [2 x J] Graph connectivity with index 0 corresponding to neighboring atoms and index 1 corresponding to center atoms. J corresponds to the total edges as determined by a nearest neighbor search.

* atomic_numbers - [K x 1] Atomic numbers of all atoms in the system

* pos - [K x 3] Initial structure positional information of all atoms in the system (x, y, z cartesian coordinates)

* natoms - 1 Total number atoms in the system

* cell - [3 x 3] System unit cell (necessary for periodic boundary condition (PBC) calculations)

* cell_offsets - [J x 3] offset matrix where each index corresponds to the unit cell offset necessary to find the corresponding neighbor in edge_index. For example, cell_offsets[0, :] = [0,1,0] corresponds to edge_index[:, 0]= [1,0] representing node 1 as node 0’s neighbor located one unit cell over in the +y direction.

* tags - [K x 1] Atomic tag information: 0 - Fixed, sub-surface atoms, 1 - Free, surface atoms 2 - Free, adsorbate atoms font>

Train/Val LMDBs additionally contain the following attributes:

* y_init - 1 Initial structure energy of the system

* y_relaxed - 1 Relaxed structure energy of the system

* pos_relaxed - [K x 3] Relaxed structure positional information of all atoms in the system (x, y, z cartesian coordinates)

This LMDB file requires no additional processing and is ready to be used directly with the repository’s Datasets and DataLoaders. Move data/ directory to your project root directory.

文件架构

启动之前 export PYTHONPATH=.

云端启动文件:scripts/threedimargen/train_oc20.sh

本地启动文件:scripts/threedimargen/train_oc20_local.sh

训练入口：sfm/tasks/threedimargen/train_threedimargendiff.py

数据集构建：sfm/data/threedimargen_data/dataset.py Line1191 & Line1294

主要模块：sfm/models/threedimargen/modules/threedimargendiff_modules.py Line 551(训练) Line739&Line863(采样)

再往里面走的话就是sfm/models/threedimargen/modules/diffusion.不过这部分没必要细看，懂怎么调就好

采样启动文件：scripts/threedimargen/gen_oc20.sh 启动之前 export PYTHONPATH=.

train_oc20_local.sh

torchrun $DISTRIBUTED_ARGS sfm/tasks/threedimargen/train_threedimargendiff.py \
    --dict_path sfm/data/threedimargen_data/dict_oc20_v2.txt \
    --save_dir ./outputs/${NAME} \
    --train_data_path /datadisk/oc20/is2re/train/ \
    --total_num_epochs 50 \
    --warmup_num_steps 1000 \
    --max_lr 1e-4 \
    --weight_decay 0.1 \
    --max_sites 500 \
    --train_batch_size 32 \
    --gradient_accumulation_steps 1 \
    --log_interval 1 \
    --strategy Zero1 \
    --no_niggli_reduced \
    --diff_mul 4 \
    --diff_depth 3 \
    --scale_coords 100 \
    --model_type threedimargen_100m \
    --diff_type diffloss \
    --target oc20 \
    --attn_implementation sdpa \