Skip to content

After source "reingestion", cleanup a source's old chunks #1

@MathyouMB

Description

@MathyouMB

NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment

🧠 Context

Currently, re-ingesting a previously indexed source will append new chunks without removing the old ones. This results in duplicate content in our system. To address this, we need to clean up any old chunks associated with a source after ingestion completes.

Chunks can be considered stale if they were created more than 2 days before the source's latest last_synced_at timestamp.


🛠 Implementation Plan

  1. Update src/ingestion/services/webpage_ingestion_service.py to schedule a follow-up cleanup task once ingestion completes.

  2. Create a new task in src/ingestion/tasks (or update an existing relevant one) that:

    • Accepts the source_id.
    • Queries for chunks tied to the source.
    • Deletes any chunk where chunk.created_at < source.last_synced_at - timedelta(days=2).
  3. Add logging to confirm cleanup activity and any chunks deleted.

  4. Write a test that verifies old chunks are deleted after ingestion.


✅ Acceptance Criteria

  • After ingestion completes for a source, schedule a second task to delete old chunks.
  • Old chunks are defined as those where chunk.created_at < source.last_synced_at - 2 days.
  • Only delete chunks for the currently ingested source.
  • This logic should live in src/ingestion/tasks as a follow-up to WebpageIngestionService.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions