diff --git a/README.md b/README.md index ec2f4c6..fb6bfdf 100644 --- a/README.md +++ b/README.md @@ -6,42 +6,42 @@ - [Table of contents](#table-of-contents) - [About BUCToolkit](#about-batch-upscaled-catalysis-toolkit) - [Installation](#installation) - - [requirements](#requirements) - - [pip installation](#pip-installation) - - [Installation from the source](#installation-from-the-source) + - [Requirements](#requirements) + - [pip Installation](#pip-installation) + - [Installation from the Source Codes](#installation-from-the-source-codes) - [Usage](#usage) - - [Project structures](#project-structures) - - [Using as a Python package](#using-as-a-python-package) - - [Using as an executable program](#using-as-an-executable-program) - - [Input file template](#input-file-template) + - [Project Structure](#project-structure) + - [Using as a Python Package](#using-as-a-python-package) + - [Using as an Executable Program](#using-as-an-executable-program) + - [Input File Template](#input-file-template) - [Post-processing](#post-processing) - [Features](#features) - - [Flexible function interfaces](#flexible-function-interfaces) - - [Batched parallel scheme](#batched-parallel-scheme) + - [Flexible Function Interfaces](#flexible-function-interfaces) + - [Highly Customizable Algorithms](#highly-customizable-algorithms) + - [Batch Parallelism Scheme](#batch-parallelism-scheme) - [Contact Us](#contact-us) - [License](#license) ## About Batch-Upscaled Catalysis Toolkit BUCToolkit is a PyTorch-based high-performance AI4Science software package of computational chemistry, -which can perform ***structural optimizations*** (both minimization and transition state search), +which is capable of performing ***structural optimizations*** (both minimization and transition state search), ***molecular dynamics*** with/without constraints, and ***Monte Carlo simulations*** by -using any python function with an interface of `func(X, *args, **kwargs)` that returns energy and -`grad_func(X, *args, **kwargs)` that returns energy gradient (i.e., the negative forces). +using any python function with an interface of `func(X, *args, **kwargs)` and `grad_func(X, *args, **kwargs)` +that return energy and energy gradient respectively (i.e., the negative forces). The most typical input functions are PyTorch-based **deep-learning models** (of molecular or crystal potentials). -For them, BUCToolkit also provided training and prediction APIs. +For them, BUCToolkit provides training and prediction APIs as well. -All above functions support **multi-structure batch parallelism** for both **regular batches** -(structures with the same atom numbers) and **irregular batches** (structures with different atom numbers). +All the functions above support **multi-structure batch parallelism** for both **regular batches** +(structures sharing the same atom numbers) and **irregular batches** (structures with different atom numbers). These core functions are highly optimized by operator fusing, cudaGraphs replaying, asynchronized dumping/logging by cuda-stream pipelines, and in-place memory calculations. -(see section [Features](#features) for details), +(See section [Features](#features) for details). -Various tools for handling catalyst structure files and data format to preprocess and postprocess -are also included. +Various tools capable of handling catalyst structure files and data formats for preprocessing and postprocessing are also included. -Manuals would be completed soon. You can find the current manuals in [Manual](Manual/). +Manuals will be completed soon. The current manuals can be found in [Manual](Manual/). -The project is still a beta version and may change in the future. +Please note that the project is still a beta version and may change in the future. ## Installation ### Requirements @@ -54,7 +54,7 @@ These following third-party libraries are optional: - **DGL** (Apache-2.3 License). Only parts of DGL models are currently supported. - **torch-geometric** (MIT License). The basic `Data` and `Batch` object have been built-in. For its other advanced functions, the whole torch-geometric can be installed. -- **ASE** (LGPL-v2.1 License) [ASE](https://gitlab.com/ase/ase/-/tree/master?ref_type=heads). Some functions involving `ase.Atoms` object, format transformation for instance. +- **[ASE (LGPL-v2.1 License)](https://gitlab.com/ase/ase/-/tree/master?ref_type=heads)**. Some functions involving `ase.Atoms` object, format transformation for instance. - **prompt-toolkit** (BSD-3-Clause License). For a better experience of CLI. Otherwise, the Python built-in `input(...)` will be used. @@ -274,7 +274,7 @@ runner.run( BUCTookit can also be directly applied as a normal executable program. By setting some additional args in the input file (see [Input File Template](#input-file-template)) to specify the data path, data type, model file, and task type, -users can directly launch the tasks in the shell like: +users can directly launch tasks in a shell like: ```shell buctoolkit -i './input_file.inp' ``` @@ -327,7 +327,7 @@ in the sub-CLI of the `edit` option. The input file should be in YAML format. Here is a completed input file template that contains all supported tasks. -The variables start with "###" are the additions only required by +The variables that start with "###" are the additional args only required by using BUCToolkit as an executable program, and those that start with "#" are normal comments. ```yaml @@ -491,10 +491,11 @@ MODEL_CONFIG: # model hyperparameters used for `MODEL_NAME.__init__(**MODEL_CO ``` ### Post-processing -There are two outputs of BUCToolkit tasks, text log file and binary database file. +There are two outputs of BUCToolkit tasks: a text log file and a binary database file. #### Log Files -For API or executables, the output of log file is set by `REDIRECT: true` with `OUTPUT_PATH` and `OUTPUT_POSTFIX`, and the contents are controlled by `VERBOSE` in the input file. If `REDIRECT` is `false`, outputs will be printed to `sys.stdout`. +For API or executables, the output of a log file is set by `REDIRECT: true` with `OUTPUT_PATH` and `OUTPUT_POSTFIX`, +and the contents are controlled by `VERBOSE` in the input file. If `REDIRECT` is `false`, outputs will be printed to `sys.stdout`. Low-level functions are controlled by the logger system. For details, see `BUCToolkit/utils/setup_loggers.py`. @@ -508,8 +509,8 @@ and reading. Its specific format is shown in the class `ArrayDumper` of `BUCTool To control the binary file output, args of `SAVE_PREDICTIONS: true` with a `PREDICTIONS_SAVE_FILE` should be set in the input file. For low-level functions, `output_file` is the related argument. -For the binary output files from structure optimization, molecular dynamics, and Monte Carlo simulations, -one can load & convert them in the shell as follows: +For the binary output files from structural optimization, molecular dynamics, and Monte Carlo simulations, +one can load & convert them in shell as follows: ```shell buctoolkit -c `$input_type` `$input_path` `$output_type` `$output_path` # `$input_path` can be one of "bs", "md", "mc", "opt", "outcar", "poscar", "cif", and "ase_traj" @@ -518,7 +519,7 @@ buctoolkit -c `$input_type` `$input_path` `$output_type` `$output_path` This command will convert all files in `$input_path` with assumed format of `$input_type` into `$output_path` in the format of `$output_type`. -For a finer control, the following python script can be used: +For a finer control, the following Python script can be used: ```python import BUCToolkit as bt from BUCToolkit.io import read_opt_structures, read_md_traj, read_mc_traj @@ -552,12 +553,12 @@ Wherein, the args of `indices` specify the selected parts to read and write inst ## Features -BUCToolkit employed highly optimized PyTorch code, including fused operators, cudaGraphs replaying, -asynchronized dumping/logging by cuda-stream pipelines, and in-place memory calculations. +BUCToolkit employs highly optimized PyTorch code including fused operators, cudaGraphs replaying, +asynchronized dumping/logging by cuda-stream pipelines, and in-place memory calculations. -### Flexible function interfaces +### Flexible Function Interfaces Major low-level functions use very flexible interfaces as follows -(also see [Using Low-level Functions](#using-low-level-functions)): +(see also [Using Low-level Functions](#using-low-level-functions)): ``` function( func=func, @@ -572,44 +573,44 @@ function( ... ) ``` -where the `X` is the target variable to update (e.g., the atom positions for molecular dynamics -and structure optimizations), `func_args` and `func_kwargs` are other necessary arguments and -keyword arguments for the `func`. Hence, any `func`, as long as it can be wrapped as +where the `X` is the target variable to update (e.g., the atom positions in molecular dynamics +and structural optimizations), `func_args` and `func_kwargs` are other necessary arguments and +keyword arguments for the `func`. Hence, any `func`, as long as able to be wrapped as `func(X, *args, **kwargs)`, is valid. For example, one may write a function that submits ab initio computations (e.g., VASP, Gaussian) and convert the results (energy and forces) into torch.Tensor format, -and BUCToolkit functions can execute with these inputs normally. +and BUCToolkit functions will be executed with these inputs normally. The `grad_func` has a similar design. The argument `is_grad_func_contain_y` controls two ways to calculate the gradient of `func`. -`is_grad_func_contain_y = True` is to use auto-gradient format, that actually uses +`is_grad_func_contain_y = True` is to use auto-gradient format, which actually uses `grad_func(X, y, *grad_func_args, **grad_func_kwargs)` internally -(Note: user would not manually put `y` into the `grad_func_args`), otherwise, interfaces of -`grad_func(X, *grad_func_args, **grad_func_kwargs)` are used. At last, `require_grad` controls the +(Note: users would not manually put `y` into `grad_func_args`). Otherwise, interfaces of +`grad_func(X, *grad_func_args, **grad_func_kwargs)` will be used. At last, `require_grad` controls the gradient context of PyTorch. When `require_grad = False`, computation of `func` and `grad_func` is under the context of `torch.no_grad` to reduce memory cost. Otherwise, gradient will be turned on explicitly by `torch.enable_grad`. -### Highly customizable algorithms +### Highly Customizable Algorithms All methods/algorithms are object-oriented modularized. They have `_Base*` abstract base classes -that implement highly optimized main loop routines, and are specialized by modifying few methods like +that implement highly optimized main loop routines, and are specialized by modifying several methods like `self.initialize*(...)` and `self._update*(...)` in subclasses. Hence, one can develop and implement any custom new algorithm by simply overriding these update methods without modifying the main loop process. -### Batch parallelism scheme -Most functions, including structure optimization, transition state search, molecular dynamics, and -Monte Carlo simulation, support the parallel for **both regular batched samples +### Batch Parallelism Scheme +Most functions, such as structural optimization, transition state search, molecular dynamics and +Monte Carlo simulation, support the parallel computing of **both regular batched samples (stacked samples with the same atom numbers) and irregular batched samples (concatenated samples with different atom numbers)**. Input Tensors (of atom coordinates, forces, fixation masks, etc.) should be 3-dimensional. For regular batches, -their shapes are **(batch_size, n_atom, n_dim)**, where `n_dim` is usually be 3. For irregular batches, their +their shapes are **(batch_size, n_atom, n_dim)**, where `n_dim` is usually 3. For irregular batches, their shapes are **(1, $\sum_{i}$n_atom$_{i}$, n_dim)**, where $i$ is the sample index, and users should provide another variable `batch_indices` that records atom numbers of each sample. For example, -`batch_indices = [64, 56, 72, 83, 102]` means samples have 64, 56, 72, 83, 102 atoms, respectively, and +`batch_indices = [64, 56, 72, 83, 102]` means that the samples have 64, 56, 72, 83, 102 atoms, respectively, and corresponding shapes of atom coordinates should be `(1, 377, 3)`. -For structure optimization and transition state search, BUCToolkit applies a **dynamic samples approach**, that -is dynamically removing the converged samples in one batch before starting next iteration steps -by maintaining a convergence mask and `indexed_select`/`indexed_copy_` functions. It could significantly reduce +For structural optimization and transition state search, BUCToolkit applies a **dynamic samples approach** +which dynamically removes the converged samples in one batch before starting the next iteration step +by maintaining a convergence mask and applying `indexed_select`/`indexed_copy_` functions. It could significantly reduce the waste of repeatedly calculating the converged data. ## Contact Us