Pipeline¶
The analysis is organized as a sequential pipeline of Jupyter notebooks. Each step builds on the outputs of the previous one, progressively narrowing the candidate space from all possible ABX₃ compositions down to a ranked shortlist of experimentally viable chalcogenide perovskites.

Notebooks¶
Run the notebooks in order. Each notebook is self-contained and will load the required intermediate data from previous steps.
Step 0 — Publication Figures¶
Generates all publication-quality figures. Can be run after all other steps are complete, or independently using pre-computed results.
Step 1 — Tolerance Factor & Feature Engineering¶
- Load and normalize the chalcogenide perovskite dataset
- Generate SISSO-derived primary features from ionic radii, electronegativities, and oxidation states
- Train a decision tree classifier on tolerance factor features
- Evaluate the SISSO tolerance factor (τ*) against the Goldschmidt and Bartel tolerance factors
- Apply Platt scaling for probabilistic predictions
- Screen all possible ABX₃ compositions for structural stability
SISSO features are pre-cached
The derived features are stored in data/interim/features_sisso.csv so you can skip the SISSO derivation step if you don't have sissopp installed.
Step 2 — CrystaLLM Structure Generation & Classification¶
- Parse CrystaLLM-generated CIF files for candidate compositions
- Classify generated structures as corner-sharing vs edge-sharing octahedral networks
- Filter for topologically valid ABX₃ perovskite geometries
- Assess structural diversity across generated candidates
Step 2.1 — Structure Relaxation¶
- Relax CrystaLLM-generated structures at DFT level using the FairChem/OMat24 universal force field
- Compare relaxed vs unrelaxed geometries and validate perovskite topology post-relaxation
- Store relaxed CIF files in
data/crystaLLM/relaxed_cif_files/
Runs on Google Colab
This notebook requires GPU access and is designed to run on Google Colab. A Hugging Face API key is needed to download the OMat24 model.
Step 3 — Experimental Plausibility (GCNN)¶
3_Experimental_likelihood.ipynb
- Assess crystal-likeness (synthesizability) using a pre-trained GCNN model
- Generate synthesizability scores for all candidate structures
- Rank candidates by experimental plausibility
Step 4 — Bandgap Prediction (CrabNet)¶
- Train (or load) a CrabNet composition-based bandgap predictor
- Evaluate model accuracy on experimental halide perovskite and chalcogenide data
- Predict bandgaps for all surviving candidate compositions
- Filter candidates within the photovoltaic-relevant bandgap window
Step 4.1 — Encoder Comparison¶
- Compare different elemental encoding strategies for CrabNet
- Benchmark Pettifor-based vs default encoders
Step 4.2 — Training Data Size Ablation¶
- Analyse the effect of training set size on CrabNet bandgap prediction accuracy
- Determine the minimum data requirement for reliable composition-based predictions
Step 5 — Sustainability Analysis¶
- Calculate the Herfindahl–Hirschman Index (HHI) for element supply concentration
- Integrate ESG (Environmental, Social, Governance) risk scores
- Combine with supply chain risk and earth-abundance metrics
- Produce the final multi-objective sustainability ranking
Data Flow¶
| Step | Inputs | Outputs |
|---|---|---|
| 1 | data/raw/ ionic radii, electronegativities |
data/interim/features_sisso.csv, screened compositions |
| 2 | CrystaLLM CIF files in data/crystaLLM/ |
Classified structures, topology labels |
| 2.1 | CrystaLLM CIF files | Relaxed CIF files in data/crystaLLM/relaxed_cif_files/, structure_relaxation_results.csv |
| 3 | Candidate CIF structures | Crystal-likeness scores |
| 4 | Candidate compositions | Predicted bandgaps |
| 5 | All previous outputs, data/sustainability_data/ |
Multi-objective sustainability ranking |