NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
When ingesting FAQ pages from the Carleton Computer Science Society (CCSS) site (i.e., URLs like https://ccss.carleton.ca/resources/faq/questions/**), the current logic creates multiple chunks—including one just for the footer text (© 2025 Carleton Computer Science Society), which pollutes the index.
These pages should be treated as structured, self-contained documents. Rather than splitting them up, we should ingest the entire page as a single chunk and explicitly exclude generic or boilerplate content like the footer.
🛠 Implementation Plan
-
In WebpageIngestionService, detect if the source URL starts with https://ccss.carleton.ca/resources/faq/questions/.
-
If it matches, bypass the default chunking logic and instead:
- Strip out the footer and boilerplate content.
- Store the entire cleaned-up page content as one chunk.
-
Add a unit test to ensure:
- The page is ingested as a single chunk.
- The chunk does not contain the
© 2025 Carleton Computer Science Society text.
✅ Acceptance Criteria
- If the URL matches the pattern
https://ccss.carleton.ca/resources/faq/questions/**, ingest the page as a single chunk.
- Do not split the content into multiple chunks.
- Exclude footer content such as
© 2025 Carleton Computer Science Society from the chunk.
- The resulting chunk should contain only the meaningful FAQ content.
NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
When ingesting FAQ pages from the Carleton Computer Science Society (CCSS) site (i.e., URLs like
https://ccss.carleton.ca/resources/faq/questions/**), the current logic creates multiple chunks—including one just for the footer text (© 2025 Carleton Computer Science Society), which pollutes the index.These pages should be treated as structured, self-contained documents. Rather than splitting them up, we should ingest the entire page as a single chunk and explicitly exclude generic or boilerplate content like the footer.
🛠 Implementation Plan
In
WebpageIngestionService, detect if the source URL starts withhttps://ccss.carleton.ca/resources/faq/questions/.If it matches, bypass the default chunking logic and instead:
Add a unit test to ensure:
© 2025 Carleton Computer Science Societytext.✅ Acceptance Criteria
https://ccss.carleton.ca/resources/faq/questions/**, ingest the page as a single chunk.© 2025 Carleton Computer Science Societyfrom the chunk.