Name: Islamic Primary Source Corpus (IPSC) V3.4
License: https://www.theogrid.ai/terms

v3.26 Staged (local) 2026-05-04

LLM PID Tiebreaker + Ship-Blocker Remediation

Eve-Theology f5/reasoner batched 25-narrator prompts assigned 12,660 PIDs across the 6 v3.19+ ingest collections (47.4% → 69.5% PID resolution). Source-collection caps applied to 2,168 records in fabrication anthologies. NRS-Taqrīb reconciliation: 6,034 confirmed, 1,345 mismatches flagged.
- 12,660 LLM-assigned PIDs (each carrying _pidTiebreakerVerdict provenance)
- 2,168 records capped via _naqd3Override (Mawduʿāt 424, Tanzīh 1,238, Fawāʾid 506)
- 7,364 new matn embeddings; cluster merge to 39.2% coverage
- All hard gates green: 33/33 regression + 7/7 cross-field verify
- Cumulative LLM cost since v3.5: ~$420
v3.13–v3.25 Staged (local) 2026-05-02 to 2026-05-03

Citation Cascade + 6 Phase 2-A Collection Ingests

v3.13 to v3.18 generalized citation grade promotion beyond Sahihayn-only. v3.19 introduced the OpenITI to IPSC ingestion pipeline (token-Jaccard + containment narrator matcher with multi-tier confidence). v3.21 to v3.25 ingested 6 new primary collections (+8,241 records). v3.25.1 fixed 5 pipeline robustness bugs.
- Ibn al-Sunnī ʿAmal al-Yawm wa al-Laylah (770 records)
- Hannād Zuhd (1,429 records)
- al-Quḍāʿī Shihāb (1,497 records)
- al-ʿUqaylī Duʿafāʾ (2,103 records)
- Ibn al-Mubārak Jihād (268 records)
- Ibn ʿAdī al-Kāmil (2,174 records)
v3.12 Staged (local) 2026-05-02

Source-Data Items + Fresh Embeddings

After verifying OpenITI 2025-1-9 source files, refreshed all 448,237 matn embeddings (previous embeddings were on broken-matn data). Kanz al-ʿUmmāl re-ingestion recovered 14,603 newly-attributed records via the OpenITI symbol parser.
- NAQD-3 fresh-embedding re-run: 1,851 findings (300 critical, 638 high)
- Previous near-zero N3-CC contradiction count was an artifact of broken embeddings
- Mudallis registry 105 → 107
- 21,264 semantic clusters at threshold 0.92
v3.11 Audit response 2026-05-02

Methodology recalibration cycle

External-examiner cycle drove a framing recalibration on one stylistic document. v3.11 added the _provenanceDisclosure manifest block so AI-involvement scope travels with every record, established the standing provenance-discipline rules, and closed NAQD-1 V1 (112 → 0).
- _provenanceDisclosure block added to corpus-v3/manifest.json
- Standing provenance-discipline rules established
- External framing recalibrated to match pipeline scope
- 11,380 ikhtilāt onset years backfilled
- 1,338 NRS undefined-tier entries explicitly flagged
v3.5–v3.10 Staged (local) 2026-04-26 to 2026-05-01

Corpus Integrity Push

Six release cycles surfacing and remediating multiple defect classes. v3.5 audit-driven corrections (256K stale temporal flags removed, 127K real issues found). v3.7 first-attempt PID validator failed regression (12 T10+ violations) — lesson incorporated into v3.9 multi-stage architecture (24,485 safe PID swaps). v3.8 critical matn recovery from broken-regex corruption that survived two release cycles because regression did not sample matn content.
- v3.8 lesson: regression tests must include content sampling, not just structural checks
- v3.9 multi-stage validator: 5 structural pre-filters + LLM tiebreaker = 24,485 swaps
- v3.10 anonymous-narrator detector reclassified 3,787 chains as munqati
- 8 git tags v3.5 → v3.12 with 33/33 regression maintained throughout
v3.4 Deployed (Azure) 2026-04-25

First Azure AI Search Deployment + Glossary v1

First IPSC corpus deployed to Azure AI Search (10 tier indexes; +3 with Glossary v1 = 13 indexes / 1,569,379 docs total). 8 vector indexes plus glossary termEmbedding. 7,652 narrators with classical scholarly quotations. Glossary v1 ship: 730 canonical hadith-science terms across 3 tier indexes.
- Tier-separated indexes (Public / Research / Scholar) with atomic version cutover
- 11 vector search indexes total (matn x 3, narrator x 3, defect x 2, term x 3)
- 7,652 narrators with Taqrīb + Tahdhīb al-Kamāl classical quotations
- Glossary v1: 730 canonical hadith-science terms with multi-source classical citations
v3.2 Staged (local) 2026-04-22

Matn-Criticism Pipeline (Phases A–G)

Two-pass architecture: Pass 1 deterministic string-op scan (420,110 records, 11,441 flagged). Pass 2 multi-tier LLM reasoning: Haiku triage → Eve-Theology f5/reasoner detail → Opus 4.6 1M scholar-grade analysis on top 5,000 concerns. Reference databases: 304 Quran rulings, 271 anachronisms, 335 fabrication patterns, 1,301 mutawatir canon entries, prophetic linguistic baseline (39 words avg, sajʿ density 0.033).
- 815 chain-matn conflicts identified
- 808 scholar-ready defense documents generated
- Sahihayn matn-criticism cleanliness: 99.4%
- Sahihayn chain-grade convergence: 97.3% (later corrected to 95.9% in v3.8)
v3.1 Staged (local) 2026-04 (early)

Cross-Reference Layers

Matn criticism pipeline analyzed 437,740 matns. 87,844 hadith cross-linked to al-Dāraquṭnī’s ilal works. Teacher-student graph constructed: 7,973 nodes, 889,913 directed edges (later corrected to 386,520 explicit edges in the v3.11 methodology recalibration). 8,258 hawala chain-switch records split. 1,279,676 Quran term cross-references + 6,447 direct quotations.
- 437,740 matns analyzed by criticism pipeline
- 87,844 hadith linked to al-Dāraquṭnī’s ilal works
- 8,258 hawala chain-switch markers identified and split
- 45 garbage-collector PIDs cleared (99,477 positions cleaned)
v3.0 Staged (local) 2026-03-20

Initial corpus release

449,415 hadith from 86 classical works. 27,099 NRS entries with 12-tier reliability assessments anchored to Ibn Ḥajar Taqrīb al-Tahdhīb. Five-condition Ibn al-Salāh grading engine. Person ID resolution pipeline (6 iterations). Structured isnad parsing (1,820,033 chain positions). Bag-of-words matn clustering (54,270 clusters).
- 449,415 hadith / 27,099 NRS entries / 86 classical works
- Five-condition framework: ittisāl al-sanad, ʿadālah, dabt, no shudhūdh, no ʿillah
- 12-tier narrator classification (Sahābi → Kadhdhāb)
- 6,983 Companions identified

Changelog

LLM PID Tiebreaker + Ship-Blocker Remediation

Citation Cascade + 6 Phase 2-A Collection Ingests

Source-Data Items + Fresh Embeddings

Methodology recalibration cycle

Corpus Integrity Push

First Azure AI Search Deployment + Glossary v1

Matn-Criticism Pipeline (Phases A–G)

Cross-Reference Layers

Initial corpus release

Source documents