NELS Portal — Brainstorm & Proposal

nels.neuralnestsolutions.com · NELS operations home · v0.1 brainstorm for BH review · 2026-06-29

1. What this is

A web portal — same look, feel and login as PMS — that becomes NELS’s operational home. Three jobs:

① Monitoring hub
Every check NELS already runs, surfaced in one place: current status + recent history.

② Jump-off
One-click links to PMS and DSM (open in a new tab).

③ Infra health & risk
Fleet-wide server health/risk at a glance, drill into any server, live status + history.

2. Audience & access

You + the team. Reuse the PMS template, look-and-feel, and user accounts (shared login / same auth) — copy PMS UI components/features where they fit, so this is consistent and fast to build.

3. The three areas in detail

① Monitoring hub

Surface what NELS already does, no new monitoring invented — just made visible:

patrol: disk/mem/loadservice & container healthTLS cert expiryFATAL log scanthreshold-alertsprod-site uptime+certdaily log-triage → JiraJira digestsheartbeat

Each shows a status light + last-checked time + a small history strip; click for detail/history.

② Jump-off links

PMS and DSM tiles → open in new tab. Trivial; folds into the top nav.

③ Infra health & risk dashboard (the centrepiece — your “insight”, fleet-wide, with memory)

Summary view: one row/card per server with a single health light (green / amber / red) and the top risk flags driving it. Drill-down: click a server → its individual items, live + history.

Summary (all servers)	Drill-down (one server)
Health light + risk score · top 1–2 risks (e.g. “swap 92%”, “cert 6d”, “uncapped build”) · last-seen	Disk · memory · swap · load vs cores · top processes · containers + memory caps · earlyoom/linger state · per-site cert expiry · recent FATAL/anomaly · cron health — each with a history chart

Risk model (reuse the exact thresholds already in threshold-alert.sh / the insight alias): disk %, mem available, swap %, load/cores, OOM-risk (uncapped containers · earlyoom off · linger off), cert days-left, any service down → combine into a per-server risk score + colour. This is the “highlight health and risk of each server” you asked for.

4. Where the data comes from (the “find a good way to pull it” question)

Fleet to cover (confirm the set with you): admin/old box 72.60.42.2 · NELS/build box 82.25.105.233 · dev EC2 18.61.6.134 · prod EC2 16.112.36.165 · monitor box 10.20.1.148 · DB boxes · marketing host. Three options weighed:

Option	How	Trade-off
A · Agentless SSH pull	A NELS collector (extends the `insight`/threshold logic) SSHes each box on a schedule, normalises to JSON. NELS already has SSH to most boxes.	+ no agents, reuses access · − SSH fan-out, secrets to manage
B · Push agent	A tiny cron on each box POSTs its metrics to the portal API.	+ scales, no central SSH · − must deploy + maintain an agent everywhere
C · Reuse nns-monitor	Prometheus/Grafana/Loki already exist on `10.20.1.148` — query/embed for history.	+ history already there · − coverage gaps (not every box scraped)

Recommendation — hybrid: a NELS metrics-collector (Option A, building on what insight/threshold-alert.sh already compute) runs on cron, pulls each reachable box + this box locally, and writes a small time-series store (lightweight — flat-file/SQLite, or push into the existing Prometheus on the monitor box so we don’t reinvent history). The portal API serves summary + drill-down + history from that store, and embeds existing Grafana panels where the monitor box already has the history. Instant status (last poll) + history (the store), exactly as you asked. No new agents in the MVP.

5. Build approach & phasing (mirrors the PMS doc lifecycle already set up for NELS)

The NELS project in PMS already has the slots: BRD → FSD → UI mock → build → Test → Access. We fill them in order. Suggested MVP-first cut:

Phase	Ship
MVP	Monitoring hub (current status) + PMS/DSM links + infra summary (health/risk) for the 2–3 key boxes · PMS look-and-feel + login · nginx + TLS on nels.neuralnestsolutions.com
v2	Per-server drill-down + history charts · full fleet coverage · Grafana embeds
v3	Trends/alerts surfaced inline · richer risk scoring · anything else you want once you’ve used it

6. Open questions for you (so the BRD lands right)

Server set: which boxes are in-scope for the infra dashboard (all of the above, or a subset)?
History depth: how far back should history go (days / weeks / months)? Drives the store choice.
Freshness: how “instant” — every 1 min, 5 min, 15 min? (Tighter = more load/SSH.)
Reuse vs standalone: OK to lean on the existing monitor box (Prometheus/Grafana) for history + embeds, rather than building a second metrics stack?
Login: shared PMS login, or its own accounts mirroring PMS?

7. Next step

On your nod (+ answers above), I’ll turn this into the formal BRD in the NELS Documents, then FSD + a clickable UI mock — same flow PMS used — before any build. Nothing gets built until you’ve steered the scope.

NELS Portal brainstorm v0.1 · prepared by NELS for BH direction · delivered via NELS project Documents (PMS) · no build started.