Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 24 additions & 6 deletions docs/handbook-internals.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,17 +47,35 @@ same thing:
structured AND/OR tree of unit-code references in `rule` JSONB. The
`description` field is **empty 99.9% of the time** — do not render
it. The rule tree is the authoritative source.
- **`enrolment_rules`** are program-level constraints ("must be
- **`enrolment_rules`** are mostly program-level constraints ("must be
enrolled in Bachelor of IT", "must have 48cp in Art, Design and
Architecture"). They ship as HTML prose only — no structured tree —
and they always have a populated `description`. You can't evaluate
these programmatically without NLP; just render the HTML.
and they always have a populated `description`. Most you can't
evaluate programmatically without NLP; just render the HTML.

**The leaky exception:** ~2,340 unit-years (Science, Engineering,
Pharmacy, Education) put their *unit-level* PREREQUISITE /
PROHIBITION / CO-REQUISITE refs *here* instead of in `requisites`,
as `<strong>PREREQUISITE</strong>: <a href=".../units/MTH1030">…`
prose. So a unit with an empty `requisites` tree is **not**
necessarily requisite-free — check `enrolment_rules` too. The ingest
extractor (`packages/ingest/src/parse.ts`) and migration `0007`
pull these into `requisite_refs`. Gotchas that bit the first pass:
one description can carry several labels (121 mix PREREQ +
PROHIBITION), the unit links use *both* the `handbook.monash.edu`
and legacy `www.monash.edu/pubs/.../units/CODE.html` hosts, the same
prose links to `/courses/` and `/aos/` (which must **not** become
unit edges), and some units list themselves. Extraction is
anchor-only and per-`<strong>`-section; plain-text codes ("…or
MTH1040") are deliberately left unparsed (NLP-only; risks reading
course codes like `4531`/`M6011` as units).

For graph-shaped queries on requisites ("what requires X?", "what
unlocks after X?"), use `requisite_refs` — it's the flat edge view of
the trees. Use `requisites.rule` only when you need AND/OR semantics
for validation ("does this student's set of completed units satisfy
this block?").
the trees, **plus** the `enrolment_rules`-derived edges above. Use
`requisites.rule` only when you need AND/OR semantics for validation
("does this student's set of completed units satisfy this block?") —
note the rule tree does *not* include the `enrolment_rules` edges.

## Graph shape: what references what

Expand Down
49 changes: 49 additions & 0 deletions packages/db/drizzle/0007_backfill_enrolment_rule_refs.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
-- Backfill requisite_refs for units that record their PREREQUISITE,
-- PROHIBITION, or CO-REQUISITE relationships as HTML prose in
-- enrolment_rules rather than in the structured requisites field.
-- (~2,340 unit-years across Science, Engineering, Pharmacy, Education and
-- others, all seven handbook years.)
--
-- Extraction is anchor-based and high-precision:
-- * The description is split into sections at each <strong> label, so a
-- description carrying several labels attributes each unit link to its
-- OWN section rather than the whole blob. This matters: 121 descriptions
-- mix PREREQUISITE and PROHIBITION, and 81/32 mix CO-REQUISITE with
-- PREREQUISITE/PROHIBITION -- classifying the whole blob would mislabel
-- ~126 edges (e.g. tag a prohibited unit as a prerequisite).
-- * Only /units/CODE hrefs are taken, across every handbook URL host the
-- corpus uses (handbook.monash.edu/<year>/units/CODE plus the legacy
-- www[3].monash.edu/pubs/.../units/CODE.html). The /courses/ and /aos/
-- links that appear in the same prose ("incompatible with course
-- versions E3001, ...") are intentionally ignored.
-- * Self-references are dropped (a unit listing itself, e.g. CHM3990's own
-- corequisite -- 105 such artifacts in the corpus).
--
-- NOT extracted: plain-text codes with no anchor ("...or MTH1040",
-- "LAW1100 or LAW1101"). Parsing those needs NLP and would mistake course
-- codes (4531, M6011) for units. See docs/handbook-internals.md.
--
-- This is kept in lockstep with the ingest extractor in
-- packages/ingest/src/parse.ts so a re-ingest reproduces exactly these rows.
-- ON CONFLICT is a no-op, so it is safe to re-run and never duplicates a
-- structured-requisite row (the two sources are disjoint: a single incidental
-- overlap across the whole 2020-2026 corpus).

--> statement-breakpoint

INSERT INTO requisite_refs (year, unit_code, requisite_type, requires_unit_code)
SELECT DISTINCT
er.year,
er.unit_code,
(CASE
WHEN seg ~* '^<strong[^>]*>\s*PREREQUISITE' THEN 'prerequisite'
WHEN seg ~* '^<strong[^>]*>\s*PROHIBITION' THEN 'prohibition'
WHEN seg ~* '^<strong[^>]*>\s*CO-?REQUISITE' THEN 'corequisite'
END)::requisite_type,
upper(m[1])
FROM enrolment_rules er,
regexp_split_to_table(er.description, '(?=<strong)') AS seg,
regexp_matches(seg, '/units/([A-Za-z][A-Za-z0-9]+)', 'g') AS m
WHERE seg ~* '^<strong[^>]*>\s*(PREREQUISITE|PROHIBITION|CO-?REQUISITE)'
AND er.unit_code <> upper(m[1])
ON CONFLICT (year, unit_code, requisite_type, requires_unit_code) DO NOTHING;
Loading