Skip to content

Ingest CCSS FAQ pages as a single chunk #3

@MathyouMB

Description

@MathyouMB

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

When ingesting FAQ pages from the Carleton Computer Science Society (CCSS) site (i.e., URLs like https://ccss.carleton.ca/resources/faq/questions/**), the current logic creates multiple chunks—including one just for the footer text (© 2025 Carleton Computer Science Society), which pollutes the index.

These pages should be treated as structured, self-contained documents. Rather than splitting them up, we should ingest the entire page as a single chunk and explicitly exclude generic or boilerplate content like the footer.


🛠 Implementation Plan

  1. In WebpageIngestionService, detect if the source URL starts with https://ccss.carleton.ca/resources/faq/questions/.

  2. If it matches, bypass the default chunking logic and instead:

    • Strip out the footer and boilerplate content.
    • Store the entire cleaned-up page content as one chunk.
  3. Add a unit test to ensure:

    • The page is ingested as a single chunk.
    • The chunk does not contain the © 2025 Carleton Computer Science Society text.

✅ Acceptance Criteria

  • If the URL matches the pattern https://ccss.carleton.ca/resources/faq/questions/**, ingest the page as a single chunk.
  • Do not split the content into multiple chunks.
  • Exclude footer content such as © 2025 Carleton Computer Science Society from the chunk.
  • The resulting chunk should contain only the meaningful FAQ content.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions