1. Executive Summary

5,807
Raw Entries
4,697
Cleaned Entries
1,110
Removed
80.9%
Retention Rate

This report documents the systematic quality control (QC) process applied to the High-Fidelity Carbide Dataset (HF-CCD), derived from the Materials Project database. Through a rigorous five-stage pipeline, we curated a high-quality dataset suitable for machine learning applications in materials informatics.

Key Achievements
  • Systematic removal of 1,110 problematic entries (19.1%)
  • Elimination of 1,098 duplicate structures
  • Validation of crystallographic consistency across all entries
  • Complete descriptor coverage (0% missing values)
  • Statistical outlier detection and removal

2. Five-Stage QC Pipeline

The HF-CCD quality control pipeline consists of five sequential stages, each designed to address specific data quality issues common in high-throughput DFT databases:

StagePurposeCriteria
1. Physical ScreeningRemove thermodynamically implausible entries−20 ≤ Ef ≤ 5 eV/atom, Ehull ≤ 0.1 eV
2. Crystallographic ValidationVerify structural integrityValid CIF parsing, 5 ų ≤ V ≤ 5000 ų
3. Descriptor CompletenessEnsure ML-ready feature vectorsMissing ratio ≤ 30%
4. DeduplicationRemove redundant entriesComposition + space group + Nsites matching
5. Outlier DetectionRemove statistical anomalies|z-score| ≤ 4 for key properties

3. Stage-by-Stage Analysis

StageRemaining EntriesRemovedRemoval Rate
Original5,807
Physical QC5,80700.0%
Geometry QC5,801−6−0.1%
Descriptor QC5,80100.0%
Deduplication4,703−1,098−18.9%
Outlier QC4,697−6−0.1%

Stage Details

Physical QC Stage:

Validates thermodynamic plausibility. Since the raw Materials Project data was already filtered during acquisition, no entries were removed. Formation energy range is set to −20 to 5 eV/atom to accommodate various stoichiometries.

Geometry QC Stage:

Verifies CIF structural integrity and completeness. Removed entries include CIF files that fail to parse and structures with implausible unit cell volumes.

Deduplication Stage:

The most impactful stage. The 1,098 duplicates removed likely originated from multiple DFT campaigns, symmetry standardization procedures, or repeated relaxation calculations. This is the most critical stage for restoring the true topological structure of composition-structure space.

Outlier QC Stage:

Statistical outliers identified using Z-score analysis (threshold = 4). Only entries with extreme values unlikely to be real are removed, preserving naturally occurring variations.

4. Material Family Distribution

The cleaned dataset encompasses 10 carbide material families. Distribution before and after cleaning:

FamilyBeforeAfterChangeShare
Diamond4,0973,885−21282.7%
SiC396392−48.3%
BoronC436191−2454.1%
MoC20090−1101.9%
TiC13841−970.9%
ZrC12138−830.8%
NbC11624−920.5%
WC14924−1250.5%
TaC876−810.1%
HfC676−610.1%
Class Imbalance: The Diamond family dominates (82.7%), while HfC and TaC have very limited samples (0.1% each). Recommended mitigation strategies:
  • Use stratified sampling for train/test splits
  • Apply class weighting in loss functions
  • Consider oversampling minority classes
  • Evaluate family-specific model performance separately

5. Material Property Statistics

Statistical summary of key properties in the cleaned dataset (N = 4,697):

PropertyCountMeanStd DevMinMaxMedianQ1Q3
Band Gap (eV)4,6971.17711.58180.00007.28140.05300.00002.2598
Formation Energy (eV/atom)4,697−0.90151.0621−3.71583.2236−0.6344−1.8445−0.2023
Density (g/cm³)4,6975.13043.08391.001718.46083.80552.98016.9741
Band Gap Distribution:
  • Median band gap of 0.053 eV indicates predominantly metallic or near-metallic character
  • Maximum of 7.28 eV represents wide-bandgap semiconductors
  • Q1 = 0 represents metallic carbides
  • Distribution strongly skewed toward low band gaps — typical for carbide materials
Formation Energy Distribution:
  • Mean of −0.90 eV/atom indicates thermodynamically stable compounds
  • Range −3.72 to 3.22 eV/atom covers both stable and metastable phases
  • 75% of entries have formation energies more negative than −0.20 eV/atom
Density Distribution:
  • Mean density 5.13 g/cm³ reflects typical carbide ceramics range
  • Minimum 1.00 g/cm³ corresponds to low-density carbon polymorphs (e.g., graphite)
  • Maximum 18.46 g/cm³ represents heavy transition metal carbides (Ta, W, Re)
  • Wide range demonstrates good compositional space coverage

6. Quality Metrics

4,697
Final Dataset Size
0
Duplicates
0.00%
Missing Values
5.0%
Outlier Rate
Quality Assurance Verification
  1. ✔ Duplicate check: 0 duplicates (verified by composition + space group + Nsites)
  2. ✔ Missing values: All descriptor fields complete (0% missing)
  3. ✔ Total features: 58 numeric descriptors available for ML models
  4. ✔ Sample size: 4,697 entries sufficient for robust model training
Overall Quality Rating: EXCELLENT

The cleaned HF-CCD dataset is fully ML-ready, with complete feature coverage, zero duplicates, and statistically coherent thermodynamic distributions. All crystallographic structures are compatible with graph neural network frameworks (CGCNN, MEGNet, ALIGNN).

7. Key Findings

Major Insights from QC Process

  • Duplicate Prevalence: Removal of 1,098 duplicates (18.9%) reveals significant redundancy in the original Materials Project subset, highlighting the importance of systematic deduplication in large-scale DFT databases.
  • Structural Quality: Only 6 entries failed crystallographic validation (0.1%), indicating Materials Project's internal QC maintains high structural integrity.
  • Thermodynamic Coherence: No entries removed during physical screening — initial API query filters were appropriately conservative.
  • Statistical Stability: Only 6 outliers detected (0.1%), demonstrating minimal pathological numerical noise in the cleaned dataset.
  • Family Diversity: Despite class imbalance, the dataset spans 10 distinct carbide families providing sufficient chemical diversity for generalized model training.

Data Quality Recommendations

  • Handle Class Imbalance: Diamond family dominance may bias model predictions. Use stratified k-fold cross-validation and class-weighted loss functions.
  • Small-Sample Families: HfC and TaC have very limited data (0.1% each). Consider transfer learning or data augmentation strategies.
  • DFT Limitations: All properties are computed values and may contain systematic deviations from experimental measurements.

8. Conclusions & Recommendations

Main Conclusions
  1. Successfully established a high-quality carbide dataset containing 4,697 validated entries
  2. Achieved 80.9% data retention while ensuring quality standards
  3. Dataset encompasses 10 material families with good chemical diversity
  4. All key properties (band gap, formation energy, density) validated and suitable for predictive model training
Data Limitations
  • Class Imbalance: Diamond family dominates — consider oversampling or weighted loss functions
  • Small-Sample Families: HfC and TaC have limited data affecting prediction accuracy
  • Computational Data: DFT-derived values may exhibit systematic deviations from experimental values

Recommendations for Future Work

  • Model Training: Start with full-dataset training, followed by family-specific fine-tuning
  • Feature Engineering: Combine structural features from CIF files for richer geometric descriptors
  • Data Augmentation: For small-sample families, consider structural perturbation or transfer learning
  • Cross-Validation: Use stratified cross-validation to ensure representative sampling of all families

9. Appendix: Output Files

The quality control pipeline generated the following output files:

File NameFormatDescription
cleaned_materials.csvCSVCleaned material data table (4,697 entries)
qc_report.jsonJSONQC statistical report (machine-readable)
cleaning_report.htmlHTMLThis detailed quality control report
plots/DirectoryAll visualization figures (PNG format)
Data Accessibility

Complete dataset and quality control scripts are available at:

  • Zenodo: https://doi.org/10.5281/zenodo.XXXXXXX
  • GitHub: https://github.com/jackman993/A-Curated-High-Fidelity-Carbide-Materials-Dataset-HF-CCD-and-Pipeline-v01.1

All data released under MIT License for unrestricted academic and commercial use.