Pipeline¶

The analysis is organized as a sequential pipeline of Jupyter notebooks. Each step builds on the outputs of the previous one, progressively narrowing the candidate space from all possible ABX₃ compositions down to a ranked shortlist of experimentally viable chalcogenide perovskites.

Screening pipeline workflow

Notebooks¶

Run the notebooks in order. Each notebook is self-contained and will load the required intermediate data from previous steps.

Step 0 — Publication Figures¶

0_figures_paper.ipynb

Generates all publication-quality figures. Can be run after all other steps are complete, or independently using pre-computed results.

Step 1 — Tolerance Factor & Feature Engineering¶

1_get_SISSO_features.ipynb

Load and normalize the chalcogenide perovskite dataset
Generate SISSO-derived primary features from ionic radii, electronegativities, and oxidation states
Train a decision tree classifier on tolerance factor features
Evaluate the SISSO tolerance factor (τ*) against the Goldschmidt and Bartel tolerance factors
Apply Platt scaling for probabilistic predictions
Screen all possible ABX₃ compositions for structural stability

SISSO features are pre-cached

The derived features are stored in data/interim/features_sisso.csv so you can skip the SISSO derivation step if you don't have sissopp installed.

Step 2 — CrystaLLM Structure Generation & Classification¶

2_CrystaLLM_analysis.ipynb

Parse CrystaLLM-generated CIF files for candidate compositions
Classify generated structures as corner-sharing vs edge-sharing octahedral networks
Filter for topologically valid ABX₃ perovskite geometries
Assess structural diversity across generated candidates

Step 2.1 — Structure Relaxation¶

2_1_StructureRelaxation.ipynb

Relax CrystaLLM-generated structures at DFT level using the FairChem/OMat24 universal force field
Compare relaxed vs unrelaxed geometries and validate perovskite topology post-relaxation
Store relaxed CIF files in data/crystaLLM/relaxed_cif_files/

Runs on Google Colab

This notebook requires GPU access and is designed to run on Google Colab. A Hugging Face API key is needed to download the OMat24 model.

Step 3 — Experimental Plausibility (GCNN)¶

3_Experimental_likelihood.ipynb

Assess crystal-likeness (synthesizability) using a pre-trained GCNN model
Generate synthesizability scores for all candidate structures
Rank candidates by experimental plausibility

Step 4 — Bandgap Prediction (CrabNet)¶

4_bandgap_prediction.ipynb

Train (or load) a CrabNet composition-based bandgap predictor
Evaluate model accuracy on experimental halide perovskite and chalcogenide data
Predict bandgaps for all surviving candidate compositions
Filter candidates within the photovoltaic-relevant bandgap window

Step 4.1 — Encoder Comparison¶

4_1_encoder_comparison.ipynb

Compare different elemental encoding strategies for CrabNet
Benchmark Pettifor-based vs default encoders

Step 4.2 — Training Data Size Ablation¶

4_2_data_size.ipynb

Analyse the effect of training set size on CrabNet bandgap prediction accuracy
Determine the minimum data requirement for reliable composition-based predictions

Step 5 — Sustainability Analysis¶

5_HHI_calculation.ipynb

Calculate the Herfindahl–Hirschman Index (HHI) for element supply concentration
Integrate ESG (Environmental, Social, Governance) risk scores
Combine with supply chain risk and earth-abundance metrics
Produce the final multi-objective sustainability ranking

Data Flow¶

Step	Inputs	Outputs
1	`data/raw/` ionic radii, electronegativities	`data/interim/features_sisso.csv`, screened compositions
2	CrystaLLM CIF files in `data/crystaLLM/`	Classified structures, topology labels
2.1	CrystaLLM CIF files	Relaxed CIF files in `data/crystaLLM/relaxed_cif_files/`, `structure_relaxation_results.csv`
3	Candidate CIF structures	Crystal-likeness scores
4	Candidate compositions	Predicted bandgaps
5	All previous outputs, `data/sustainability_data/`	Multi-objective sustainability ranking