A production-grade Copilot-style conversational API built in C# / .NET 8 with Azure OpenAI, Redis, and SQL, designed for reliability, cost control, and LLM inference at scale.
flowchart TD
A[Client<br/>POST /api/chat] --> B[Rate Limiter<br/>Redis token bucket · 100 req/min]
B -->|429 if exceeded| Z1[HTTP 429]
B --> C[Idempotency Check<br/>SHA-256 key · 60s TTL]
C -->|duplicate requestId| Z2[Return cached response]
C --> D[CopilotController<br/>Validates input · orchestrates pipeline]
D -->|invalid input| Z3[HTTP 400]
D --> E[Memory Service<br/>Last 20 turns · 24h TTL · token trimmed]
D --> F[Semantic Cache<br/>Redis hash lookup]
F -->|cache hit| Z4[Return cached response · skip LLM]
F --> G[Prompt Builder<br/>System + history + query · token budget enforced]
G --> H[LLM Orchestrator<br/>Retry + exponential backoff · graceful degradation]
H -->|GPT-4o healthy| I[Azure OpenAI GPT-4o]
H -->|GPT-4o throttled| J[Azure OpenAI GPT-4o-mini<br/>lower-cost fallback]
H -->|both fail| Z5[degraded: true · partial response]
I --> K[Response Formatter<br/>SSE streaming · JSON fallback]
J --> K
K --> L[Audit Logger<br/>SQL · tokens · latency · cost estimate]
L --> M[Client receives response]
Every inbound request passes through a strict pipeline:
- Rate limiter checks a Redis token bucket, 100 requests per minute per user. Returns HTTP 429 immediately if exceeded.
- Idempotency check looks up the requestId in Redis. If the same request was seen in the last 60 seconds, the cached response is returned immediately, no duplicate LLM call.
- Semantic cache hashes the full prompt and checks Redis. A cache hit skips the LLM entirely, zero token cost.
- Memory service fetches the last 20 conversation turns from Redis, trimmed dynamically to fit within the model token budget.
- Prompt builder assembles system prompt + trimmed history + user message and enforces the token ceiling before the LLM call.
- LLM orchestrator calls GPT-4o with retry and exponential backoff. Falls back to GPT-4o-mini under throttling. Returns a graceful partial response if both fail — never a hard 500.
- Response formatter streams tokens back via SSE as they arrive, or returns full JSON if SSE is not requested.
- Audit logger writes token count, model used, latency, and a cost estimate to SQL on every request.
- Context-aware memory: Redis-backed session history capped at 20 turns with 24h TTL and dynamic token trimming. This prevents unbounded Redis growth and keeps prompts within the model context window.
- Semantic caching: Prompt hash lookup in Redis before every LLM call. This eliminates redundant inference calls and directly reduces token cost.
- Idempotent request handling: SHA-256 keyed deduplication with 60s TTL. This handles network retries and client double-submits without duplicate LLM calls.
- Graceful degradation: LLM orchestrator retries GPT-4o with exponential backoff, falls back to GPT-4o-mini, and returns a partial response if both fail. The service does not return a hard 500 during normal throttling and dependency failure paths.
- Cost visibility: Every request logs token count, model used, and a cost estimate to SQL. This enables cost-per-user analysis and cache hit rate tracking over time.
| Layer | Technology |
|---|---|
| Runtime | .NET 8 (C#) — current LTS, Microsoft-native |
| Web framework | ASP.NET Core Minimal API + Controllers |
| LLM primary | Azure OpenAI — GPT-4o (gpt-4o deployment name) |
| LLM fallback | Azure OpenAI — GPT-4o-mini (lower-cost fallback) |
| Cache / memory | Redis 7 via StackExchange.Redis |
| Audit database | SQLite (local dev) / PostgreSQL (prod-ready) |
| ORM | Entity Framework Core 8 |
| Token counting | Microsoft.ML.Tokenizers (tiktoken-compatible) |
| Containerisation | Docker + docker-compose |
| Streaming | Server-Sent Events (SSE) via IAsyncEnumerable |
| Testing | xUnit + Moq |
- .NET 8 SDK
- Docker and docker-compose
- Azure OpenAI resource with
gpt-4oandgpt-4o-minideployments
cp .env.example .env
# Open .env and fill in:
# AzureOpenAI__ApiKey — your Azure OpenAI API key
# AzureOpenAI__Endpoint — your Azure OpenAI endpoint URI
# AzureOpenAI__PrimaryDeployment=gpt-4o
# AzureOpenAI__FallbackDeployment=gpt-4o-mini
docker-compose up -d
dotnet run --project src/Four commands. That is the entire local setup.
Production note: In production, supply environment variables directly through your orchestrator (Kubernetes, Azure App Service, etc.) — do not deploy a
.envfile.
Request:
curl -X POST http://localhost:8080/api/chat \
-H "Content-Type: application/json" \
-d '{
"userId": "user-123",
"sessionId": "session-abc",
"requestId": "550e8400-e29b-41d4-a716-446655440000",
"message": "Summarise the last sprint retrospective"
}'Success response:
{
"response": "The last sprint retrospective covered...",
"cacheHit": false,
"degraded": false,
"modelUsed": "gpt-4o",
"promptTokens": 412,
"completionTokens": 187,
"latencyMs": 1243.5
}Rate limit response:
HTTP/1.1 429 Too Many Requests
Retry-After: 60
Content-Type: application/json
{
"error": "Rate limit exceeded",
"retryAfterSeconds": 60
}dotnet test tests/CopilotApi.Tests/The test suite covers:
- Rate limiter: token bucket logic, fail-open on Redis unavailability
- Idempotency: SHA-256 key hashing, cache hit and miss paths
- Memory service: bounded history, TTL reset, token budget trimming
- LLM orchestrator: retry policy, fallback activation, cost calculation, Responsible AI log compliance
The following shows a successful end-to-end request against the local /api/chat endpoint backed by Azure OpenAI GPT-4o:
Response includes model used, token counts, latency, and cost visibility fields — all logged to SQL via the audit logger.
| Variable | Required | Description |
|---|---|---|
AzureOpenAI__ApiKey |
Yes | Authenticates requests to the Azure OpenAI resource. Maps to AzureOpenAI:ApiKey via the ASP.NET Core double-underscore convention. |
AzureOpenAI__Endpoint |
Yes | Specifies the Azure OpenAI endpoint URI. Maps to AzureOpenAI:Endpoint. |
AzureOpenAI__PrimaryDeployment |
No | Overrides the primary deployment name from appsettings.json. Defaults to gpt-4o. |
AzureOpenAI__FallbackDeployment |
No | Overrides the fallback deployment name from appsettings.json. Defaults to gpt-4o-mini. |
DATABASE_PROVIDER |
No | Selects the SQL provider. Supported values are sqlite and postgresql. Defaults to sqlite. |
DATABASE_CONNECTION_STRING |
Yes | Supplies the SQLite or PostgreSQL connection string used by Entity Framework Core. |
Redis__ConnectionString |
No | Overrides the Redis connection string from appsettings.json. Defaults to localhost:6379. |
ASPNETCORE_ENVIRONMENT |
No | Sets the hosting environment. Use Development locally, Production in deployment. |
ASPNETCORE_URLS |
No | Sets the listen address. Defaults to http://+:8080 in Docker. |
copilot-plugin-api/
|-- docs/
| `-- images/
| `-- chat-success.png
|-- src/
| |-- Configuration/
| | |-- AzureOpenAIConfig.cs
| | |-- CostsConfig.cs
| | |-- MemoryConfig.cs
| | |-- PromptConfig.cs
| | |-- RateLimitConfig.cs
| | |-- RedisConfig.cs
| | `-- SemanticCacheConfig.cs
| |-- Controllers/
| | `-- CopilotController.cs
| |-- Data/
| | |-- Migrations/
| | | |-- 20260318035158_InitialCreate.cs
| | | |-- 20260318035158_InitialCreate.Designer.cs
| | | `-- AppDbContextModelSnapshot.cs
| | |-- AppDbContext.cs
| | |-- AuditLogConfiguration.cs
| | `-- AuditLogger.cs
| |-- Models/
| | |-- ChatRequest.cs
| | |-- ChatResponse.cs
| | |-- ConversationTurn.cs
| | |-- LlmResult.cs
| | `-- PromptTooLargeException.cs
| |-- Services/
| | |-- IdempotencyService.cs
| | |-- LlmOrchestratorService.cs
| | |-- MemoryService.cs
| | |-- PromptBuilderService.cs
| | |-- RateLimiterService.cs
| | `-- SemanticCacheService.cs
| |-- CopilotPluginApi.csproj
| `-- Program.cs
|-- tests/
| `-- CopilotApi.Tests/
| |-- CopilotApi.Tests.csproj
| |-- IdempotencyTests.cs
| |-- LlmOrchestratorTests.cs
| |-- MemoryServiceTests.cs
| `-- RateLimiterTests.cs
|-- .dockerignore
|-- .env.example
|-- .gitignore
|-- appsettings.Development.json.example
|-- appsettings.json
|-- CopilotPluginApi.sln
|-- docker-compose.yml
|-- Dockerfile
|-- LICENSE
`-- README.md
This project is architected for production but intentionally scoped for clarity. The next layer of hardening would include:
- Polly ResiliencePipeline replacing the manual retry loop in LlmOrchestratorService — circuit breaker, bulkhead isolation, and hedging policies
- OpenTelemetry distributed tracing across all service layers — trace ID propagated from client through to audit log
- Vector similarity caching via Azure AI Search or Qdrant — catches semantically equivalent prompts that hash differently
- Kubernetes deployment with HorizontalPodAutoscaler keyed on request queue depth — Redis-backed shared state makes horizontal scaling safe
- Prompt injection detection middleware — validates user input against known injection patterns before reaching the LLM
MIT
