Skip to content

feat(backend): add projection-based Firestore cache foundation (DD-007)#7958

Open
Git-on-my-level wants to merge 2 commits into
BasedHardware:mainfrom
Git-on-my-level:fix/firestore-cache-projections
Open

feat(backend): add projection-based Firestore cache foundation (DD-007)#7958
Git-on-my-level wants to merge 2 commits into
BasedHardware:mainfrom
Git-on-my-level:fix/firestore-cache-projections

Conversation

@Git-on-my-level

Copy link
Copy Markdown
Collaborator

Summary

Adds the DD-007 projection-based Firestore cache foundation for reducing Firestore read amplification without taking the unsafe shortcut of caching the full users/{uid} document.

This is the first PR in the long-term Firestore read-reduction architecture recommended by the DD-007 committee review.

Why this approach

Firestore default database currently costs roughly $3,807/mo net with ~10.7B reads/month. DD-007 found repeated reads of the same user document on WebSocket startup and other hot paths.

The naive fix would be a full user-doc Redis cache. This PR explicitly avoids that because users/{uid} mixes low-risk preferences with correctness-critical fields:

Field family Why not full-doc cache
subscription / entitlement stale cache could grant/block paid access incorrectly
BYOK stale state could accept removed keys or misroute provider billing
data protection / migration stale level could write data under wrong encryption/storage policy
privacy consent stale opt-out could continue recording/training/syncing
account/security state stale decisions are security-sensitive

Instead this PR introduces a typed projection cache and only applies it to low-risk projections.

Changes

1. New reusable cache foundation

Files:

  • backend/database/firestore_cache.py
  • backend/database/firestore_cache_metrics.py

Features:

  • synchronous Redis read-through cache
  • disabled by default via FIRESTORE_CACHE_ENABLED=false
  • per-namespace override support, e.g. FIRESTORE_CACHE_USER_LANGUAGE_ENABLED=true
  • deterministic versioned keys: fs:v{global}:{namespace}:v{policy}:{entity}
  • typed JSON envelope with version, kind, created_at, fresh_until, payload
  • TTL jitter to avoid synchronized cache expiry
  • Redis fail-open behavior — on Redis read/write errors, Firestore remains source of truth
  • datetime round-trip serialization to preserve return shapes
  • low-cardinality Prometheus metrics only (namespace, result) — no UID/PII labels

2. Low-risk user projection caches only

File: backend/database/users.py

Cached projections:

Function Namespace TTL Notes
get_user_language_preference() user_language 300s low-risk listen startup setting
get_user_transcription_preferences() user_transcription_prefs 120s low-risk startup prefs + language
get_ai_user_profile() user_ai_profile 300s read-mostly profile metadata

Not cached:

  • get_user_subscription() / get_user_valid_subscription()
  • BYOK state
  • get_data_protection_level()
  • private cloud sync flag
  • recording/training consent
  • full get_user_profile()

3. Invalidation hooks

Setter/update path Invalidates
set_user_language_preference() user_language, user_transcription_prefs
set_user_transcription_preferences() user_transcription_prefs
set_user_custom_stt_usage() user_transcription_prefs
update_ai_user_profile() user_ai_profile

Important: update_ai_user_profile() reads existing profile directly from Firestore (not from the cached getter) to avoid stale cached read-modify-write overwrites.

4. Architecture docs

File: backend/docs/firestore-cache-architecture.md

Documents why projection cache, not full user-doc cache; policy classes and what not to cache; runtime flags and rollback; cache key/envelope design; metrics and invalidation expectations; rollout plan.

Cost impact

This PR is the foundation + first safe projection rollout, not the entire DD-007 savings package.

Phase Expected impact
This PR: low-risk projection cache foundation modest immediate reduction; architecture leverage
Follow-up: collapse /v4/listen repeated user reads into one typed startup projection larger user-doc read reduction
Follow-up: shadow-mode entitlement/BYOK/data-protection caches only after mismatch metrics prove safety
DD-007 total opportunity $760–$2,290/mo

This PR is intentionally conservative: it prioritizes correctness and long-term architecture over grabbing all savings in one risky full-doc cache.

Validation

cd backend && python3 -m pytest tests/unit/test_firestore_cache.py -v --tb=short

Result: 7 passed in 0.06s

Independent subagent review: caught stale cached read-modify-write in update_ai_user_profile(). Fixed by adding _get_ai_user_profile_from_firestore(). Final review: APPROVED.

Rollback

Runtime rollback: FIRESTORE_CACHE_ENABLED=false

Emergency abandon: FIRESTORE_CACHE_GLOBAL_VERSION=2

No data migration required — Firestore remains source of truth.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 88ca73c5e0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@@ -0,0 +1,158 @@
import json

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add the cache test to backend/test.sh

backend/AGENTS.md requires that new test files be added to test.sh; I checked backend/test.sh, and it enumerates unit tests explicitly but does not include this new tests/unit/test_firestore_cache.py file. As a result, the Firestore cache coverage added here will not run in CI unless the runner is updated.

Useful? React with 👍 / 👎.

@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Introduces a typed, projection-based Redis read-through cache for Firestore under backend/database/, then applies it to three low-risk user projections (language preference, transcription preferences, AI profile). The implementation is deliberately conservative: disabled by default, fail-open on all Redis errors, bounded by versioned cache keys, and explicitly excludes entitlement/BYOK/privacy fields from caching.

  • firestore_cache.py: reusable cache foundation with per-namespace env-var overrides, TTL jitter, payload-size guard, Prometheus metrics, and datetime round-trip serialization; _GLOBAL_VERSION is captured at import time rather than call time, so the documented emergency rollback via FIRESTORE_CACHE_GLOBAL_VERSION requires a process restart (unlike the runtime-effective FIRESTORE_CACHE_ENABLED flag).
  • users.py: three getters are wrapped with get_or_fetch; all four corresponding setter paths carry explicit invalidate() calls; update_ai_user_profile reads directly from Firestore (bypassing the cache) to avoid stale read-modify-write corruption.
  • Tests: 7 unit tests cover disable, hit, miss, Redis error, datetime round-trip, and key-collision cases; two guardrail tests open database/users.py via bare relative path, which fails when pytest is invoked outside the backend/ directory.

Confidence Score: 4/5

Safe to merge; the cache is off by default and Firestore remains the authoritative source of truth in all failure modes.

The core logic is sound: projections are narrow, invalidation hooks cover all write paths, and the fail-open design means a Redis outage cannot degrade correctness. The two items worth fixing before enabling in production are the _GLOBAL_VERSION import-time capture (the emergency version-bump rollback requires a redeploy, not just a config change, which is undocumented) and the test guardrails using bare relative file paths that break outside the backend/ working directory.

backend/database/firestore_cache.py for the _GLOBAL_VERSION import-time capture and the now or time.time() guard; backend/tests/unit/test_firestore_cache.py for the relative-path fragility in the two source-inspection tests.

Important Files Changed

Filename Overview
backend/database/firestore_cache.py New projection-based Redis read-through cache; well-structured fail-open behavior and TTL jitter, but _GLOBAL_VERSION is fixed at import time (inconsistent with the runtime-readable is_enabled() flag) and the now or time.time() falsy guard should use is None.
backend/database/firestore_cache_metrics.py Clean Prometheus metrics module with low-cardinality labels only (namespace, result); no PII exposure, buckets are reasonable for payload-size distribution.
backend/database/users.py Three low-risk projections wired to the cache (language, transcription prefs, AI profile); all setter paths include invalidation hooks; update_ai_user_profile correctly reads directly from Firestore to avoid stale read-modify-write.
backend/tests/unit/test_firestore_cache.py Good coverage of disable, hit, miss, Redis error, datetime round-trip, and key-collision paths; two guardrail tests use bare relative file paths that break when pytest is not run from backend/.
backend/docs/firestore-cache-architecture.md Comprehensive architecture doc covering projection rationale, rollout plan, key design, metrics, and rollback; rollback section should clarify that FIRESTORE_CACHE_GLOBAL_VERSION takes effect only after a process restart, unlike the runtime-effective FIRESTORE_CACHE_ENABLED flag.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant users.py
    participant firestore_cache
    participant Redis
    participant Firestore

    Caller->>users.py: get_user_language_preference(uid)
    users.py->>firestore_cache: get_or_fetch(policy, uid, fetch_fn)

    alt cache disabled (default)
        firestore_cache->>Firestore: fetch_fn() — field projection [language]
        Firestore-->>firestore_cache: value
        firestore_cache-->>users.py: value
    else Redis down
        firestore_cache->>Redis: GET key
        Redis-->>firestore_cache: error
        firestore_cache->>Firestore: fetch_fn() (fail-open)
        Firestore-->>firestore_cache: value
        firestore_cache-->>users.py: value
    else cache miss
        firestore_cache->>Redis: GET key
        Redis-->>firestore_cache: nil
        firestore_cache->>Firestore: fetch_fn()
        Firestore-->>firestore_cache: value
        firestore_cache->>Redis: "SET key envelope ex=ttl+jitter"
        firestore_cache-->>users.py: value
    else cache hit
        firestore_cache->>Redis: GET key
        Redis-->>firestore_cache: envelope
        firestore_cache-->>users.py: envelope.payload
    end

    users.py-->>Caller: language string

    Note over Caller,Firestore: On write path
    Caller->>users.py: set_user_language_preference(uid, lang)
    users.py->>Firestore: "set({language: lang}, merge=True)"
    users.py->>firestore_cache: invalidate(USER_LANGUAGE_CACHE, uid)
    firestore_cache->>Redis: DEL key
    users.py->>firestore_cache: invalidate(USER_TRANSCRIPTION_PREFS_CACHE, uid)
    firestore_cache->>Redis: DEL key
Loading

Reviews (1): Last reviewed commit: "fix(backend): make Firestore cache keys ..." | Re-trigger Greptile


logger = logging.getLogger(__name__)

_GLOBAL_VERSION = os.getenv('FIRESTORE_CACHE_GLOBAL_VERSION', '1')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Global version captured at import time

_GLOBAL_VERSION is read once when the module is imported. The architecture docs advertise FIRESTORE_CACHE_GLOBAL_VERSION=2 as a "runtime emergency rollback", but in practice it takes effect only after a full process restart (pod redeploy). By contrast, is_enabled() reads its env vars on every call, so the FIRESTORE_CACHE_ENABLED kill-switch works without a restart. This asymmetry is not documented in the rollback section, and an operator following the playbook during an incident could reasonably believe changing the env var is sufficient without a redeploy.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!



def _set(policy: CachePolicy, key: str, payload: Any, now: Optional[float] = None) -> None:
now = now or time.time()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 now = now or time.time() is a falsy-value check on an Optional[float]. A timestamp of exactly 0.0 (January 1, 1970 UTC) would incorrectly be treated as None and replaced with the current time, corrupting fresh_until. Use an explicit is None guard to match the Optional[float] contract.

Suggested change
now = now or time.time()
now = now if now is not None else time.time()

Comment on lines +127 to +128
def test_users_module_only_wires_safe_projection_caches():
source = open('database/users.py').read()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Both test_users_module_only_wires_safe_projection_caches and test_ai_profile_update_bypasses_cached_getter_for_merge_safety open database/users.py with a bare relative path. This only works when pytest is invoked from the backend/ directory. Running from the repo root or any other location silently fails with FileNotFoundError, making the guardrail tests unreliable in CI unless the working directory is always pinned. Use pathlib.Path(__file__).parent.parent / "database" / "users.py" to anchor the path to the test file's location.

Suggested change
def test_users_module_only_wires_safe_projection_caches():
source = open('database/users.py').read()
def test_users_module_only_wires_safe_projection_caches():
import pathlib
source = (pathlib.Path(__file__).parent.parent / 'database' / 'users.py').read_text()

Comment on lines +150 to +151
def test_ai_profile_update_bypasses_cached_getter_for_merge_safety():
source = open('database/users.py').read()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Same relative-path fragility as the prior test — anchoring to the test file's parent avoids a working-directory dependency.

Suggested change
def test_ai_profile_update_bypasses_cached_getter_for_merge_safety():
source = open('database/users.py').read()
def test_ai_profile_update_bypasses_cached_getter_for_merge_safety():
import pathlib
source = (pathlib.Path(__file__).parent.parent / 'database' / 'users.py').read_text()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants