initial commit

This commit is contained in:
2026-03-25 00:05:57 +01:00
commit 25c7d598ca
63 changed files with 5257 additions and 0 deletions

View File

@@ -0,0 +1,150 @@
# Cykl życia AgentRun
#sympozium #agenty #lifecycle
## Pełny przepływ reconciliation
`AgentRunReconciler` to **największy i najważniejszy controller** w systemie (~900 linii kodu). Zarządza pełnym lifecycle od Pending do Completed.
## Faza: Pending → Running
```
reconcilePending():
├── 1. validatePolicy()
│ └── Sprawdza SympoziumPolicy:
│ - Sandbox requirements
│ - Tool gating
│ - Feature gates
│ - AgentSandbox policy
├── 2. Agent Sandbox check
│ └── Jeśli agentSandbox.enabled → reconcilePendingAgentSandbox()
│ (tworzy Sandbox CR zamiast Job)
├── 3. ensureAgentServiceAccount()
│ └── ServiceAccount "sympozium-agent" w target namespace
├── 4. createInputConfigMap()
│ └── ConfigMap z task, system prompt, memory context
├── 5. Lookup SympoziumInstance
│ ├── Memory enabled? → prepend memory instructions
│ ├── Observability config → inject OTel env vars
│ ├── Skills inheritance → copy from instance if empty
│ └── MCP servers → resolve URLs from MCPServer CRs
├── 6. ensureMCPConfigMap()
│ └── ConfigMap z konfiguracją MCP serwerów
├── 7. resolveSkillSidecars()
│ └── SkillPack CRDs → resolved sidecar specs
├── 8. Server mode check
│ └── mode=server → reconcilePendingServer() (Deployment+Service)
├── 9. Filter server-only sidecars (task mode)
├── 10. Memory server readiness check
│ └── Jeśli memory skill → sprawdź czy Deployment istnieje
├── 11. Build Job
│ ├── PodBuilder.BuildAgentContainer()
│ ├── PodBuilder.BuildIPCBridgeContainer()
│ ├── Skill sidecar containers
│ ├── MCP bridge sidecar (jeśli MCP servers)
│ ├── Sandbox sidecar (jeśli enabled)
│ ├── Memory volumes/init containers
│ ├── Secret mirroring (system → run namespace)
│ └── OTel tracing setup
├── 12. Create ephemeral RBAC
│ ├── Role + RoleBinding (namespace-scoped, ownerRef)
│ └── ClusterRole + ClusterRoleBinding (label-based)
├── 13. NetworkPolicy
│ └── deny-all + allow DNS + allow NATS
└── 14. Create Job → Status: Running
```
## Faza: Running
```
reconcileRunning():
├── Poll Job status (co 10s via requeue)
├── Pod Succeeded → extractResults():
│ ├── Read pod logs
│ ├── Extract result text
│ ├── Extract memory markers (__SYMPOZIUM_MEMORY__)
│ ├── Patch memory ConfigMap
│ ├── Extract token usage
│ └── Set status.result, completedAt, tokenUsage
│ → Status: Succeeded
├── Pod Failed →
│ ├── Read pod logs for error
│ ├── Set status.error, exitCode
│ └── Status: Failed
└── Timeout → failRun() → Status: Failed
```
## Faza: Succeeded/Failed
```
reconcileCompleted():
├── Clean up ephemeral RBAC
│ ├── Delete ClusterRole (label: agentrun=<name>)
│ └── Delete ClusterRoleBinding
├── Prune run history
│ └── Keep max 50 runs per instance (DefaultRunHistoryLimit)
└── Remove finalizer → AgentRun deletable
```
## Faza: Serving (server mode)
```
reconcileServing():
├── Sprawdź Deployment health
├── Sprawdź Service health
├── Reconcile HTTPRoute (Envoy Gateway)
└── Requeue co 30s
```
## Obsługa usunięcia
```
reconcileDelete():
├── Delete server-mode resources (Deployment, Service, HTTPRoute)
├── Delete ephemeral RBAC
├── Delete input ConfigMap
├── Delete MCP ConfigMap
├── Remove finalizer
└── AgentRun usunięty
```
## OTel Tracing
Każda faza reconciliation jest tracowana:
- `agentrun.reconcile` - główny span
- `agentrun.create_job` - tworzenie Job
- Traceparent propagowany do agent poda via env var
- TraceID zapisany w `status.traceID`
## Metryki
- `sympozium.agent.runs` - counter (success/failure labels)
- `sympozium.agent.duration_ms` - histogram czasu trwania
- `sympozium.errors` - counter błędów
---
Powiązane: [[AgentRun]] | [[Cykl życia Agent Pod]] | [[Orchestrator - PodBuilder i Spawner]]