PaperWork - AI4S

Code Architect

TASK: IS2RE

配置

README_is2res.md

This readme describes the file details for the IS2RE/IS2RS tasks and all its subsequent data splits.

This folder contains files organized as follows:

1
2
3
4
5
6
7
8
9
10
11
data/is2re/10k/train/data.lmdb
data/is2re/100k/train/data.lmdb
data/is2re/all/train/data.lmdb
data/is2re/all/val_id/data.lmdb
data/is2re/all/val_ood_ads/data.lmdb
data/is2re/all/val_ood_cat/data.lmdb
data/is2re/all/val_ood_both/data.lmdb
data/is2re/all/test_id/data.lmdb
data/is2re/all/test_ood_ads/data.lmdb
data/is2re/all/test_ood_cat/data.lmdb
data/is2re/all/test_ood_both/data.lmdb

There is additional .lmdb-lock file present alongside each .lmdb file.

data/is2re/N/M/data.lmdb is an LMDB file containing N PyTorch Geometric Data objects from adsorbate+catalyst systems in the corresponding M split. Each LMDB shall contain the following number of Data objects:

1
2
3
4
5
6
7
8
9
10
11
data/is2re/10k/train/data.lmdb:       10,000
data/is2re/100k/train/data.lmdb: 100,000
data/is2re/all/train/data.lmdb: 460,328
data/is2re/all/val_id/data.lmdb: 24,943
data/is2re/all/val_ood_ads/data.lmdb: 24,961
data/is2re/all/val_ood_cat/data.lmdb: 24,963
data/is2re/all/val_ood_both/data.lmdb: 24,987
data/is2re/all/test_id/data.lmdb: 24,948
data/is2re/all/test_ood_ads/data.lmdb: 24,930
data/is2re/all/test_ood_cat/data.lmdb: 24,965
data/is2re/all/test_ood_both/data.lmdb: 24,985

Each Data object includes the following information for each corresponding system (assuming K atoms):

* sid - 1 System ID corresponding to each structure

* edge_index - [2 x J] Graph connectivity with index 0 corresponding to neighboring atoms and index 1 corresponding to center atoms. J corresponds to the total edges as determined by a nearest neighbor search.

* atomic_numbers - [K x 1] Atomic numbers of all atoms in the system

* pos - [K x 3] Initial structure positional information of all atoms in the system (x, y, z cartesian coordinates)

* natoms - 1 Total number atoms in the system

* cell - [3 x 3] System unit cell (necessary for periodic boundary condition (PBC) calculations)

* cell_offsets - [J x 3] offset matrix where each index corresponds to the unit cell offset necessary to find the corresponding neighbor in edge_index. For example, cell_offsets[0, :] = [0,1,0] corresponds to edge_index[:, 0]= [1,0] representing node 1 as node 0’s neighbor located one unit cell over in the +y direction.

* tags - [K x 1] Atomic tag information: 0 - Fixed, sub-surface atoms, 1 - Free, surface atoms 2 - Free, adsorbate atoms font>

Train/Val LMDBs additionally contain the following attributes:

* y_init - 1 Initial structure energy of the system

* y_relaxed - 1 Relaxed structure energy of the system

* pos_relaxed - [K x 3] Relaxed structure positional information of all atoms in the system (x, y, z cartesian coordinates)

This LMDB file requires no additional processing and is ready to be used directly with the repository’s Datasets and DataLoaders. Move data/ directory to your project root directory.

文件架构

启动之前 export PYTHONPATH=.

云端启动文件:scripts/threedimargen/train_oc20.sh

本地启动文件:scripts/threedimargen/train_oc20_local.sh

训练入口:sfm/tasks/threedimargen/train_threedimargendiff.py

数据集构建:sfm/data/threedimargen_data/dataset.py Line1191 & Line1294

主要模块:sfm/models/threedimargen/modules/threedimargendiff_modules.py Line 551(训练) Line739&Line863(采样)

再往里面走的话就是sfm/models/threedimargen/modules/diffusion.不过这部分没必要细看,懂怎么调就好

采样启动文件:scripts/threedimargen/gen_oc20.sh 启动之前 export PYTHONPATH=.

train_oc20_local.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
torchrun $DISTRIBUTED_ARGS sfm/tasks/threedimargen/train_threedimargendiff.py \
--dict_path sfm/data/threedimargen_data/dict_oc20_v2.txt \
--save_dir ./outputs/${NAME} \
--train_data_path /datadisk/oc20/is2re/train/ \
--total_num_epochs 50 \
--warmup_num_steps 1000 \
--max_lr 1e-4 \
--weight_decay 0.1 \
--max_sites 500 \
--train_batch_size 32 \
--gradient_accumulation_steps 1 \
--log_interval 1 \
--strategy Zero1 \
--no_niggli_reduced \
--diff_mul 4 \
--diff_depth 3 \
--scale_coords 100 \
--model_type threedimargen_100m \
--diff_type diffloss \
--target oc20 \
--attn_implementation sdpa \

train_threedimargendiff.py

1

我的疑问

1

1.cell这个3x3的信息,在数据读入并处理的过程中,只读了cell[0],也就是 1x3 的讯息

这是否会造成信息利用不充分?

这一步的目的是将原子和力的空间坐标转换到晶格坐标系

下面是我搜到的将笛卡尔坐标转化为晶格坐标的步骤:

image-20250202223026928image-20250202222957578

也就是说,要完成完整的坐标转换,实际应该采取的操作是 \[ \overrightarrow{pos}_{new} = \overrightarrow {pos} \cdot A^{-1} \] pos的shape 应为 [Kx3] 也即(x,y,z)

A的shape 应为 [3x3] 也即 cell

获得的新pos的shape 则为 [Kx3]

然而,在运行test.ipynb时却发现,在数据Data中,pos却是一个[1x2]的数据,这是怎么一回事呢?

故怀疑,对于pos和force的数据处理有一些问题

2

不理解这个

3

也不太理解这个,感觉有点奇怪

跟1有相关性

首先是,pos的形状应该是[Kx3]的,那么取len(pos),得到的应该是K,那么是不是可以理解为,进行token化时,每个原子理论上有三个晶格坐标系上的pos(x,y,z),然后一个mask_idx就对应了这三个值?

主要是不太理解mask_idx

4

理应:lattice [3x3] pos[Kx3],二者拼接

但是好像前面处理lattice和pos的时候,二者的形状不是这个样子的