NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
Pages under https://carleton.ca/scs/** follow a consistent layout where the meaningful page content is contained within a specific section of the HTML (<div id="content" or similar). However, our current ingestion logic does not account for this, and as a result, it may pick up irrelevant navigation bars, side menus, or other layout elements.
See the green section. We don't the navbar ingested each time.

To improve quality and consistency, we should restrict ingestion for these pages to only the main content section.
🛠 Implementation Plan
-
In WebpageIngestionService, detect if the source URL starts with https://carleton.ca/scs/.
-
If it matches:
- Parse the HTML and extract only the content within the main section (typically
<div id="content">).
- Use this content for chunking instead of the full page body.
-
Add a test with a sample HTML page from carleton.ca/scs to verify that only the expected content is ingested.
✅ Acceptance Criteria
- When ingesting pages from
https://carleton.ca/scs/**, extract content only from the main content section of the page (e.g., <div id="content">).
- Exclude headers, navigation, footers, sidebars, or any boilerplate elements.
- The chunk(s) should contain only the relevant main body content.
NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
Pages under
https://carleton.ca/scs/**follow a consistent layout where the meaningful page content is contained within a specific section of the HTML (<div id="content"or similar). However, our current ingestion logic does not account for this, and as a result, it may pick up irrelevant navigation bars, side menus, or other layout elements.See the green section. We don't the navbar ingested each time.
To improve quality and consistency, we should restrict ingestion for these pages to only the main content section.
🛠 Implementation Plan
In
WebpageIngestionService, detect if the source URL starts withhttps://carleton.ca/scs/.If it matches:
<div id="content">).Add a test with a sample HTML page from
carleton.ca/scsto verify that only the expected content is ingested.✅ Acceptance Criteria
https://carleton.ca/scs/**, extract content only from the main content section of the page (e.g.,<div id="content">).