NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
Currently, re-ingesting a previously indexed source will append new chunks without removing the old ones. This results in duplicate content in our system. To address this, we need to clean up any old chunks associated with a source after ingestion completes.
Chunks can be considered stale if they were created more than 2 days before the source's latest last_synced_at timestamp.
🛠 Implementation Plan
-
Update src/ingestion/services/webpage_ingestion_service.py to schedule a follow-up cleanup task once ingestion completes.
-
Create a new task in src/ingestion/tasks (or update an existing relevant one) that:
- Accepts the
source_id.
- Queries for chunks tied to the source.
- Deletes any chunk where
chunk.created_at < source.last_synced_at - timedelta(days=2).
-
Add logging to confirm cleanup activity and any chunks deleted.
-
Write a test that verifies old chunks are deleted after ingestion.
✅ Acceptance Criteria
- After ingestion completes for a source, schedule a second task to delete old chunks.
- Old chunks are defined as those where
chunk.created_at < source.last_synced_at - 2 days.
- Only delete chunks for the currently ingested source.
- This logic should live in
src/ingestion/tasks as a follow-up to WebpageIngestionService.
NOTE: Don't assign yourself unless you have have confirmed with Matthew you've got a working environment
🧠 Context
Currently, re-ingesting a previously indexed source will append new chunks without removing the old ones. This results in duplicate content in our system. To address this, we need to clean up any old chunks associated with a source after ingestion completes.
Chunks can be considered stale if they were created more than 2 days before the source's latest
last_synced_attimestamp.🛠 Implementation Plan
Update
src/ingestion/services/webpage_ingestion_service.pyto schedule a follow-up cleanup task once ingestion completes.Create a new task in
src/ingestion/tasks(or update an existing relevant one) that:source_id.chunk.created_at < source.last_synced_at - timedelta(days=2).Add logging to confirm cleanup activity and any chunks deleted.
Write a test that verifies old chunks are deleted after ingestion.
✅ Acceptance Criteria
chunk.created_at < source.last_synced_at - 2 days.src/ingestion/tasksas a follow-up toWebpageIngestionService.