nels.neuralnestsolutions.com · NELS operations home · v0.1 brainstorm for BH review · 2026-06-29
A web portal — same look, feel and login as PMS — that becomes NELS’s operational home. Three jobs:
You + the team. Reuse the PMS template, look-and-feel, and user accounts (shared login / same auth) — copy PMS UI components/features where they fit, so this is consistent and fast to build.
Surface what NELS already does, no new monitoring invented — just made visible:
patrol: disk/mem/loadservice & container healthTLS cert expiryFATAL log scanthreshold-alertsprod-site uptime+certdaily log-triage → JiraJira digestsheartbeatEach shows a status light + last-checked time + a small history strip; click for detail/history.
PMS and DSM tiles → open in new tab. Trivial; folds into the top nav.
Summary view: one row/card per server with a single health light (green / amber / red) and the top risk flags driving it. Drill-down: click a server → its individual items, live + history.
| Summary (all servers) | Drill-down (one server) |
|---|---|
| Health light + risk score · top 1–2 risks (e.g. “swap 92%”, “cert 6d”, “uncapped build”) · last-seen | Disk · memory · swap · load vs cores · top processes · containers + memory caps · earlyoom/linger state · per-site cert expiry · recent FATAL/anomaly · cron health — each with a history chart |
Risk model (reuse the exact thresholds already in threshold-alert.sh / the insight alias): disk %, mem available, swap %, load/cores, OOM-risk (uncapped containers · earlyoom off · linger off), cert days-left, any service down → combine into a per-server risk score + colour. This is the “highlight health and risk of each server” you asked for.
Fleet to cover (confirm the set with you): admin/old box 72.60.42.2 · NELS/build box 82.25.105.233 · dev EC2 18.61.6.134 · prod EC2 16.112.36.165 · monitor box 10.20.1.148 · DB boxes · marketing host. Three options weighed:
| Option | How | Trade-off |
|---|---|---|
| A · Agentless SSH pull | A NELS collector (extends the insight/threshold logic) SSHes each box on a schedule, normalises to JSON. NELS already has SSH to most boxes. | + no agents, reuses access · − SSH fan-out, secrets to manage |
| B · Push agent | A tiny cron on each box POSTs its metrics to the portal API. | + scales, no central SSH · − must deploy + maintain an agent everywhere |
| C · Reuse nns-monitor | Prometheus/Grafana/Loki already exist on 10.20.1.148 — query/embed for history. | + history already there · − coverage gaps (not every box scraped) |
insight/threshold-alert.sh already compute) runs on cron, pulls each reachable box + this box locally, and writes a small time-series store (lightweight — flat-file/SQLite, or push into the existing Prometheus on the monitor box so we don’t reinvent history). The portal API serves summary + drill-down + history from that store, and embeds existing Grafana panels where the monitor box already has the history. Instant status (last poll) + history (the store), exactly as you asked. No new agents in the MVP.The NELS project in PMS already has the slots: BRD → FSD → UI mock → build → Test → Access. We fill them in order. Suggested MVP-first cut:
| Phase | Ship |
|---|---|
| MVP | Monitoring hub (current status) + PMS/DSM links + infra summary (health/risk) for the 2–3 key boxes · PMS look-and-feel + login · nginx + TLS on nels.neuralnestsolutions.com |
| v2 | Per-server drill-down + history charts · full fleet coverage · Grafana embeds |
| v3 | Trends/alerts surfaced inline · richer risk scoring · anything else you want once you’ve used it |
On your nod (+ answers above), I’ll turn this into the formal BRD in the NELS Documents, then FSD + a clickable UI mock — same flow PMS used — before any build. Nothing gets built until you’ve steered the scope.