PaperWork-AI4S-TASK:is2res
PaperWork - AI4S
Code Architect
TASK: IS2RE
配置
README_is2res.md
This readme describes the file details for the IS2RE/IS2RS tasks and all its subsequent data splits.
This folder contains files organized as follows:
1 | data/is2re/10k/train/data.lmdb |
There is additional .lmdb-lock file present alongside
each .lmdb file.
data/is2re/N/M/data.lmdb is an LMDB file containing N PyTorch
Geometric Data objects from adsorbate+catalyst systems in the
corresponding M split. Each LMDB shall contain the following number of
Data objects:
1 | data/is2re/10k/train/data.lmdb: 10,000 |
Each Data object includes the following information for each corresponding system (assuming K atoms):
* sid - 1 System ID corresponding
to each structure
* edge_index - [2 x J] Graph connectivity with index 0
corresponding to neighboring atoms and index 1 corresponding to center
atoms. J corresponds to the total edges as determined by a nearest
neighbor search.
* atomic_numbers - [K x 1] Atomic numbers of all atoms
in the system
* pos - [K x 3] Initial structure positional information
of all atoms in the system (x, y, z cartesian coordinates)
* natoms - 1 Total number atoms
in the system
* cell - [3 x 3] System unit cell
(necessary for periodic boundary condition (PBC)
calculations)
* cell_offsets - [J x 3] offset matrix
where each index corresponds to the unit cell offset necessary to find
the corresponding neighbor in edge_index. For
example, cell_offsets[0, :] = [0,1,0] corresponds to
edge_index[:, 0]= [1,0] representing node 1 as node 0’s
neighbor located one unit cell over in the +y direction.
* tags - [K x 1] Atomic tag
information: 0 - Fixed, sub-surface atoms, 1 - Free,
surface atoms 2 - Free, adsorbate atoms font>
Train/Val LMDBs additionally contain the following attributes:
* y_init - 1 Initial structure
energy of the system
* y_relaxed - 1 Relaxed structure
energy of the system
* pos_relaxed - [K x 3] Relaxed structure positional
information of all atoms in the system (x, y, z cartesian
coordinates)
This LMDB file requires no additional processing and is ready to be
used directly with the repository’s Datasets and DataLoaders. Move
data/ directory to your project root directory.
文件架构
启动之前 export PYTHONPATH=.
云端启动文件:scripts/threedimargen/train_oc20.sh
本地启动文件:scripts/threedimargen/train_oc20_local.sh
训练入口:sfm/tasks/threedimargen/train_threedimargendiff.py
数据集构建:sfm/data/threedimargen_data/dataset.py Line1191 & Line1294
主要模块:sfm/models/threedimargen/modules/threedimargendiff_modules.py
Line 551(训练) Line739&Line863(采样)
再往里面走的话就是sfm/models/threedimargen/modules/diffusion.不过这部分没必要细看,懂怎么调就好
采样启动文件:scripts/threedimargen/gen_oc20.sh 启动之前
export PYTHONPATH=.
train_oc20_local.sh
1 | torchrun $DISTRIBUTED_ARGS sfm/tasks/threedimargen/train_threedimargendiff.py \ |
train_threedimargendiff.py
1 |
我的疑问
1
1.cell这个3x3的信息,在数据读入并处理的过程中,只读了cell[0],也就是
1x3 的讯息
这是否会造成信息利用不充分?
这一步的目的是将原子和力的空间坐标转换到晶格坐标系
下面是我搜到的将笛卡尔坐标转化为晶格坐标的步骤:
也就是说,要完成完整的坐标转换,实际应该采取的操作是 \[ \overrightarrow{pos}_{new} = \overrightarrow {pos} \cdot A^{-1} \] pos的shape 应为 [Kx3] 也即(x,y,z)
A的shape 应为 [3x3] 也即 cell
获得的新pos的shape 则为 [Kx3]
然而,在运行test.ipynb时却发现,在数据Data中,pos却是一个[1x2]的数据,这是怎么一回事呢?
故怀疑,对于pos和force的数据处理有一些问题
2
不理解这个
3
也不太理解这个,感觉有点奇怪
跟1有相关性
首先是,pos的形状应该是[Kx3]的,那么取len(pos),得到的应该是K,那么是不是可以理解为,进行token化时,每个原子理论上有三个晶格坐标系上的pos(x,y,z),然后一个mask_idx就对应了这三个值?
主要是不太理解mask_idx
4
理应:lattice [3x3] pos[Kx3],二者拼接
但是好像前面处理lattice和pos的时候,二者的形状不是这个样子的







