Ingest SCS page chunks from main section

## **NOTE:** Don't assign yourself unless you have have confirmed with Matthew you've got a working environment


## 🧠 Context

Pages under `https://carleton.ca/scs/**` follow a consistent layout where the meaningful page content is contained within a specific section of the HTML (`<div id="content"` or similar). However, our current ingestion logic does not account for this, and as a result, it may pick up irrelevant navigation bars, side menus, or other layout elements.

See the green section. We don't the navbar ingested each time.

![Image](https://github.com/user-attachments/assets/9ce9bccf-b07b-49be-a104-0891abf7099e)

To improve quality and consistency, we should restrict ingestion for these pages to only the main content section.

---

## 🛠 Implementation Plan

1. In `WebpageIngestionService`, detect if the source URL starts with `https://carleton.ca/scs/`.
2. If it matches:

   * Parse the HTML and extract only the content within the main section (typically `<div id="content">`).
   * Use this content for chunking instead of the full page body.
3. Add a test with a sample HTML page from `carleton.ca/scs` to verify that only the expected content is ingested.

---

## ✅ Acceptance Criteria

* When ingesting pages from `https://carleton.ca/scs/**`, extract content only from the main content section of the page (e.g., `<div id="content">`).
* Exclude headers, navigation, footers, sidebars, or any boilerplate elements.
* The chunk(s) should contain only the relevant main body content.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest SCS page chunks from main section #4

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

🛠 Implementation Plan

✅ Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ingest SCS page chunks from main section #4

Description

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

🛠 Implementation Plan

✅ Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions