The Islamic Primary Source Corpus (IPSC) V3 is a computationally parsed and graded dataset of 449,415 hadith drawn from 38,787+ collection-level source entries across 86 classical works. Each hadith record carries a structured chain of transmission (isnad), a separated body text (matn), a computational authenticity grade, and metadata linking to narrator reliability assessments, hidden defect records, and textual parallels.
The corpus provides structured, machine-readable hadith data with a transparent and reproducible grading methodology. Every grade is derived from documented inputs — narrator reliability tiers, chain continuity classification, defect cross-links, and corroboration counts — so that any scholar can trace the reasoning behind any individual grade.
No pre-existing scholarly grades are imported as authoritative. The corpus grades hadith independently from the chain data, then compares its output against existing scholarly opinions where available.
Nine Major Collections
| Collection | Compiler | Death (AH) |
|---|---|---|
| Sahih al-Bukhari | al-Bukhari | 256 |
| Sahih Muslim | Muslim ibn al-Hajjaj | 261 |
| Sunan Abi Dawud | Abu Dawud | 275 |
| Jami' al-Tirmidhi | al-Tirmidhi | 279 |
| Sunan al-Nasa'i | al-Nasa'i | 303 |
| Sunan Ibn Majah | Ibn Majah | 273 |
| Musnad Ahmad | Ahmad ibn Hanbal | 241 |
| Muwatta' Malik | Malik ibn Anas | 179 |
| Sunan al-Darimi | al-Darimi | 255 |
Supplementary Collections
Sahih Ibn Hibban, Sahih Ibn Khuzaymah, al-Mustadrak (al-Hakim), al-Sunan al-Kubra (al-Bayhaqi), Musannaf 'Abd al-Razzaq, Musannaf Ibn Abi Shaybah, al-Mu'jam al-Kabir/al-Awsat/al-Saghir (al-Tabarani), Musnad al-Bazzar, Musnad Abu Ya'la, Musnad al-Tayalisi, Sunan al-Daraqutni, Shu'ab al-Iman (al-Bayhaqi), Sharh Ma'ani al-Athar (al-Tahawi), Tafsir al-Tabari, and additional musnad, musannaf, and mu'jam works. Fabrication-detection references: Tanzih al-Shari'ah (Ibn 'Iraq) and al-La'ali al-Masnu'ah (al-Suyuti).
2.1 Arabic Text Normalization
- Harakat removal — all tashkil (fathah, dammah, kasrah, sukun, shaddah, tanwin) stripped
- Hamza normalization — أ , إ , آ normalized to bare alif ا
- Ta marbuta — terminal ة normalized to ه
- Alif maqsura — ى normalized to ي
- Whitespace — multiple spaces, zero-width joiners, non-breaking spaces collapsed
2.2 Person ID (PID) Assignment
Each narrator position maps to at most one canonical Person ID. PIDs take the form PERSON-NNNNNN (six-digit, zero-padded) for Taqrib narrators and PERSON-6NNNNNNN (eight-digit, prefix 60) for supplementary sources. Ambiguous positions retain null rather than recording a potentially incorrect assignment.
2.3 NRS Database
The Narrator Reliability Score database contains 27,118 assessed entries (within a broader 37,046-entry narrator index). Sources, in precedence order:
- Taqrib al-Tahdhib — Ibn Hajar al-'Asqalani (d. 852 AH) — primary anchor
- Tahdhib al-Tahdhib — Ibn Hajar — detailed assessments
- Mizan al-I'tidal — al-Dhahabi (d. 748 AH)
- al-Thiqat — Ibn Hibban (d. 354 AH)
- al-Kamil fi Du'afa' al-Rijal — Ibn 'Adi (d. 365 AH)
- al-Jarh wa-l-Ta'dil — Ibn Abi Hatim (d. 327 AH)
- al-Tarikh al-Kabir — al-Bukhari (d. 256 AH)
- Tarikh Ibn Ma'in — Yahya ibn Ma'in (d. 233 AH)
- Lisan al-Mizan — Ibn Hajar
2.4 Resolution Approaches
2.5 Coverage
78.5% of narrator positions carry a resolved PID. An additional 4.9% are structural (collective/anonymous references). The remaining 16.6% genuine null consist of ambiguous kunyahs, unresolvable relational references, single-name narrators with multiple candidates, and collective references. These are genuine disambiguation gaps — the system does not guess.
3.1 Taqrib Anchoring
Ibn Hajar's Taqrib al-Tahdhib serves as the primary and authoritative source. When a Taqrib verdict exists, it is never overridden by other sources, reflecting scholarly consensus that Ibn Hajar's Taqrib represents the most careful synthesis of the earlier critical tradition.
3.2 Source Hierarchy
- Taqrib al-Tahdhib — if available, final. Never overridden.
- Tahdhib al-Tahdhib — detailed discussion when Taqrib absent.
- Multiple non-Ibn-Hajar sources — 2+ independent critics agree, consensus adopted.
- Single source — adopted with reduced confidence.
When sources conflict, the weaker assessment prevails unless the stronger comes from a higher-hierarchy source.
3.3 Twelve-Tier System
| Tier | Arabic | Transliteration | English | Grading Impact |
|---|---|---|---|---|
| T1 | صحابي | Sahabi | Companion | Automatic pass — beyond jarh wa-ta'dil |
| T2 | ثقة متقن | Thiqah mutqin | Very reliable, precise | Supports sahih |
| T3 | ثقة | Thiqah | Reliable | Supports sahih |
| T4 | صدوق | Saduq | Truthful | Supports hasan |
| T5 | صدوق يهم | Saduq yahim | Truthful but errs | Supports hasan |
| T6 | مقبول | Maqbul | Acceptable when supported | Da'if alone; hasan with corroboration |
| T7 | ضعيف / مجهول | Da'if / majhul | Weak / unknown | Da'if |
| T8 | ضعيف جدا | Da'if jiddan | Very weak | Da'if (eligible for taqwiyah) |
| T9 | متروك | Matruk | Abandoned | Very weak — also for anonymous narrators |
| T10 | متروك | Matruk (severe) | Abandoned (severe) | Very weak — corroboration blocked |
| T11 | متهم بالكذب | Muttaham bi-l-kadhib | Accused of lying | Very weak — corroboration blocked |
| T12 | كذاب / وضاع | Kadhdhab / wadda' | Liar / fabricator | Mawdu' (fabricated) — corroboration blocked |
Key principle: T1–T3 support sahih. T4–T6 support hasan. T7–T8 produce da'if. T9+ produce very weak or fabricated and cannot be strengthened by corroboration — the deficiency lies in 'adalah (moral integrity), not merely dabt (precision).
The grading engine implements the classical five-condition framework of Ibn al-Salah (Muqaddimah):
chainContinuity field shudhudh flag crossLinks_ilal and ilalDefectCount 4.1 Resolution Threshold
A hadith is graded only when 50% or more of its narrator positions carry resolved PIDs. Below this, the grade is set to not-graded with computedConfidence: "low".
4.2 Base Grade from Weakest Narrator
| Weakest Tier | Base Grade |
|---|---|
| T1–T3 | sahih |
| T4–T6 | hasan |
| T7–T8 | da'if |
| T9+ | very-weak |
| T12 | mawdu' (fabricated) |
4.3 Quality Caps
- Uncertain resolutions below T4 are capped at T4 (saduq) — benefit of the doubt
- Original tier T8 or worse: cap rises to T6 (maqbul)
- Phase-5 relational resolutions (father/grandfather) capped at T4 regardless
4.4 Chain Continuity Adjustments
- Broken chain (munqati' / mu'allaq): downgraded one level
- Uncertain chain with all T1–T3 narrators: conservatively set to hasan
- Continuous chain: no adjustment
4.5 Mursal Cap
If chainContinuity = "mursal" and no companion PID is found at the terminal position, the grade is capped at da'if. With 2+ independent supporting chains, a mursal may reach hasan li-ghayrihi.
4.6 Taqwiyah (Mutual Strengthening)
- Da'if + 2+ independent chains → upgraded to hasan li-ghayrihi
- Hasan + 3+ independent chains → upgraded to sahih li-ghayrihi
- Independence requirement: supporting chains must not share a common bottleneck narrator (madar). For 10+ chains, a square-root discount is applied.
- Hard floor: taqwiyah blocked when weakest narrator is T10+. A liar corroborated by other liars does not become truthful.
4.7 Defect Handling
- Single 'illah: flagged, confidence reduced, no automatic downgrade
- Two+ defect records: downgraded one level
- Shudhudh: if flagged and grade is sahih, reduced to hasan
4.8 Hawala Handling
The hawala marker (ح) indicates a chain-switch — 8,751 records flagged. The grading engine grades the primary (first) chain only. The secondary chain is noted in autoGradeDetail but does not override.
4.9 Anonymous Narrator Penalty
Collective or anonymous references (nas, rajul, ba'd ashabihi) are assigned T9 (matruk/majhul) because no individual can be identified for reliability assessment.
| Classification | Meaning |
|---|---|
continuous | Standard connected chain — each narrator heard directly from the next, verified by temporal overlap and known teacher-student relationships |
muttasil | Verified as connected to a companion — initially ambiguous, later confirmed |
likely-continuous | Probable connection based on death-date overlap and generational proximity, without explicit documentation |
scholarly-verified | Continuity confirmed by classical scholarship (e.g., al-Mizzi in Tahdhib al-Kamal) |
mursal | Chain does not reach a companion through verified hearing — a tabi'i reports directly from the Prophet |
muallaq | Suspended: one or more narrators at the beginning omitted by the compiler |
compilation | Compiler's own chain or editorial arrangement |
uncertain | Insufficient data to determine connectivity |
parser-error | Chain text could not be reliably parsed — receives not-graded |
Continuity is determined by checking adjacent narrator pairs. Chain break severity = impossible pairs / total pairs. Severity above 0.3 classifies the chain as broken.
6a. Transmission Formulas
Tasrih (explicit hearing): haddathana/haddathani, sami'tu, akhbarana/akhbarani, anba'ana — these explicitly indicate direct hearing.
'An'anah (ambiguous): the formula 'an ("from") does not explicitly state direct hearing. When the narrator is a known mudallis of severity 3+, 'an'anah triggers a chain-level flag.
6b. Tadlis Detection
Registry of 48 narrators catalogued with severity levels 1–5, derived from Ibn Hajar's Tabaqat al-Mudalliseen.
| Level | Description | Treatment of 'an'anah |
|---|---|---|
| 1 | Rarely practiced tadlis | Accepted |
| 2 | Scholars tolerated due to status or rarity | Generally accepted |
| 3 | Scholars differed; significant number practiced frequently | Not accepted without tasrih |
| 4 | Scholars rejected their 'an'anah altogether | Not accepted |
| 5 | Weak narrators who also practiced tadlis | Not accepted |
Level 3+ with 'an'anah triggers a one-level downgrade. Note: ~388,000 positions (~21%) have no parsed transmission formula — a parser-level limitation.
6c. Ikhtilat (Mental Deterioration)
Structured data on 29 mukhtalit narrators with onset year, pre/post student lists, and biographical sources. When detected, records are flagged with _ikhtilat: true.
6d. 'Ilal (Hidden Defects)
16,082 entries cross-linked from al-Daraqutni's al-'Ilal al-Waridah. Defect types: mursal, mawquf-as-marfu', tadlis, wahm (error), da'if chain. Two+ cross-links trigger a one-level downgrade.
6e. Attestation Levels
| Level | Chains | Count |
|---|---|---|
| gharib | 1 (solitary) | — |
| 'aziz | 2 | — |
| mashhur | 3+ | 12,209 clusters |
| mutawatir | Mass-transmitted | 1,161 clusters |
Additionally, 6,346 common-link clusters (chains converging on a single pivotal transmitter). Attestation is computed at the matn cluster level.
A two-pass computational matn criticism architecture — the first of its kind applied at corpus scale.
What It Checks
- Quran contradictions
- Fabrication patterns
- Anachronistic vocabulary
- Chain-matn conflicts
- Prophetic linguistic baseline deviation
Results
- 280 likely fabrications identified
- 91 chain-matn conflicts detected
- 6,822 known fabrications validated against reference works
Prophetic Linguistic Baseline
Average word count: 39 words. Saj' (rhyming prose) density: 0.033. These baselines help identify texts that deviate significantly from the established prophetic speech patterns.
8.1 Convergence with Scholarly Consensus
These rates were independently achieved — the system was not tuned to match Bukhari or Muslim.
8.2 Integrity Checks
| Check | Criterion | Result |
|---|---|---|
| Mursal graded sahih | Should never independently receive sahih | 0 violations |
| T10+ in Sahihayn | Abandoned narrators should not appear in Bukhari/Muslim | 0 violations |
| Grade consistency | computedGrade = autoGrade across 250-record test set | 100% agreement |
| Top-500 PID audit | Manual verification of 500 most frequent PIDs | 436/500 (87.2%) |
8.3 Grade Distribution
| Grade | Count | % |
|---|---|---|
| sahih | 47,441 | 10.6% |
| sahih li-ghayrihi | 111,996 | 24.9% |
| hasan | 71,160 | 15.8% |
| hasan li-ghayrihi | 52,208 | 11.6% |
| da'if | 33,645 | 7.5% |
| very-weak | 24,859 | 5.5% |
| not-graded | 98,769 | 22.0% |
Total graded: 350,646 (78.0%). Taqwiyah upgrades: 168,284. Quality caps applied: 139,531. Ilal-flagged: 7,009.
Narrator Resolution
- 16.6% null PID rate — genuine disambiguation gaps (ambiguous kunyahs, relational references, single-name narrators)
- Death year approximation — for ~39,799 entries, estimated from tabaqah rather than documented sources
Textual Verification
- Arabic matn text not collated against critical printed editions
- Hawala records flagged but not physically split (8,751 records)
Enrichment Coverage
- Matn cluster coverage: 61% — hadith without cluster assignments do not benefit from cross-chain attestation
- Mudallis registry covers 48 of ~150+ documented mudalliseen
- Ikhtilat database covers 29 narrators — classical sources document dozens more
- Shudhudh detection is flag-based, not comprehensive
Methodological Scope
- No fiqhi (jurisprudential) context — grading is purely chain-based
- Single-chain grading — the corpus does not perform full takhrij
JSONL format — one JSON object per line. Primary index: ipsc-hadith-v3.jsonl (449,415 records).
Core Fields
| Field | Type | Description |
|---|---|---|
id | string | Unique identifier (e.g., bukhari-sahih-000001) |
workId | string | Collection identifier |
collection | string | Human-readable collection name |
hadithNumber | string | Number within collection |
arabicText | string | Full Arabic text (isnad + matn) |
isnad | string | Separated chain of transmission (Arabic) |
matn | string | Separated body text (Arabic) |
hadithType | string | marfu, mawquf, maqtu, tafsir, mawdu |
Isnad Structure
| Field | Type | Description |
|---|---|---|
position | number | Ordinal position (1 = compiler's source) |
name | string | Narrator name as it appears in Arabic |
canonicalPersonId | string|null | Resolved PID or null |
formula | string | Transmission formula text |
formulaType | string | tasrih or ananah |
_nrs | object | Embedded NRS: tier, label, deathAH, isCompanion |
_mudallis | object | Severity and requiresTasrih (if applicable) |
_resolvedBy | string | Resolution method |
Grading Fields
| Field | Type | Description |
|---|---|---|
computedGrade | string | Final grade: sahih, sahih-li-ghayrihi, hasan, hasan-li-ghayrihi, daif, very-weak, mawdu, not-graded |
autoGrade | string | V3 final regrade pass |
chainContinuity | string | Connectivity classification |
computedConfidence | string | high, medium, or low |
gradingNotes | array | Human-readable grading explanations |
Enrichment Flags
| Field | Type | Description |
|---|---|---|
_ikhtilat | boolean | Mukhtalit narrator in chain |
crossLinks_ilal | array | Defect record references |
ilalDefectCount | number | Count of known defects |
shudhudh | boolean | Textual anomaly detected |
isCompound | boolean | Contains hawala chain-switch marker |
attestationLevel | string | gharib, aziz, mashhur, or mutawatir |
resolutionRate | number | Fraction of positions with resolved PIDs |
Supplementary Indexes
| Index | Records | Description |
|---|---|---|
ipsc-narrators-v3 | 37,046 | Full narrator database (NRS + biographical) |
ipsc-ilal-v3 | 16,082 | Hidden defect records from al-Daraqutni |
ipsc-matn-clusters-v3 | 54,270 | Matn cluster records with English summaries |
ipsc-presentation-v3 | 6 | Presentation-layer summary statistics |
Individual Hadith
IPSC V3 Corpus, TheoAI / Eve Theology LLC, 2026. Hadith [id], graded [computedGrade]. Narrator resolution via NRS v3 (27,118 entries). Methodology: docs/methodology-v3.md.
Corpus-Level
TheoAI / Eve Theology LLC. Islamic Primary Source Corpus (IPSC), Version 3. 2026. 449,415 hadith across 86 classical works, computationally graded via chain analysis with 27,118 narrator reliability entries.
Methodology
TheoAI / Eve Theology LLC. "IPSC V3 Technical Methodology." 2026. Covers narrator resolution pipeline, twelve-tier assessment reconciliation, five-condition grading algorithm, tadlis detection, ikhtilat flagging, and ilal cross-linking.
Specific Grading Decision
Hadith [id] graded [grade] per IPSC V3 methodology: weakest narrator [worstNarrator] at tier [worstTier] ([worstLabel]), chain continuity: [chainContinuity], supporting chains: [supportingChains]. Taqwiyah: [applied/not applied]. See autoGradeDetail for full provenance.