返回 Skill 列表
extension
分类: 开发与工程无需 API Key

monitoring-operations

在设置OCI指标、告警或日志收集,或解决缺失数据和无声告警问题时使用。涵盖指标命名空间命名、MQL维度要求、告警缺失数据处理、服务连接器IAM缺口以及Cloud Guard集成。关键词:监控、告警、指标、MQL、命名空间、日志、服务连接器、日志分析、Cloud Guard、缺失数据、oci_computeagent。

person作者: jakexiaohubgithub

OCI Monitoring and Observability - Expert Knowledge

NEVER Do This

NEVER debug "missing metrics" within the first 15 minutes

  • Metrics are published every 1–5 minutes
  • Processing delay adds another 5–10 minutes
  • Total lag from event to visible metric: 10–15 minutes
  • Premature debugging creates false investigations

NEVER use = for alarm thresholds with sparse metrics

# WRONG - alarm never fires when metric has data gaps
MetricName[1m].mean() = 0

# RIGHT - handle missing data explicitly
MetricName[1m]{dataMissing=zero}.mean() > 0

NEVER omit the resourceId dimension in metric queries

# WRONG - returns no data (required dimension missing)
CPUUtilization[1m].mean()

# RIGHT - filter by instance OCID
CPUUtilization[1m]{resourceId="<instance-ocid>"}.mean()

Querying without dimensions returns data for ALL resources — usually not what's intended, and rate-limited at 1000 req/min.

NEVER set alarm thresholds without a trigger delay

# BAD - fires on every transient CPU spike (alert fatigue)
CPUUtilization[1m].mean() > 80

# BETTER - fires only on sustained breach
CPUUtilization[5m].mean() > 80
# + set trigger delay: 5 minutes (5 consecutive breaches)

NEVER create alarms without notification destinations

# WRONG - alarm fires but nobody is notified
oci monitoring alarm create ... --destinations '[]'

# RIGHT - always link to a notification topic
oci monitoring alarm create ... --destinations '["<notification-topic-ocid>"]'

Cost impact: undetected production outages = $5,000–50,000+/hour.

NEVER ignore Cloud Guard findings

  • Cloud Guard detects misconfigurations before they become incidents
  • Wire it: Cloud Guard → Notifications → email/Slack/PagerDuty
  • Unresolved findings fail CIS/SOC2/HIPAA audits

Metric Namespace Reference

OCI uses service-specific namespaces — using the wrong namespace returns no data with no error.

| Service | Namespace | Key Metrics | |------------------|------------------------------|------------------------------------------| | Compute | oci_computeagent | CPUUtilization, MemoryUtilization | | Autonomous DB | oci_autonomous_database | CpuUtilization, StorageUtilization | | Load Balancer | oci_lbaas | HttpRequests, UnHealthyBackendServers| | Object Storage | oci_objectstorage | ObjectCount, BytesUploaded |

Common mistake: using oci_compute instead of oci_computeagent — the agent namespace requires the OCI Compute Agent to be running on the instance.

Alarm Missing Data Handling

| Setting | Behavior | Use When | |---------|----------|----------| | treatMissingDataAsBreaching | Alarm fires if no data arrives | Critical services (silence = outage) | | treatMissingDataAsNotBreaching | Alarm silent if no data | Optional or intermittent monitoring | | {dataMissing=zero} in MQL | Treats gaps as 0 value | Request counters, throughput metrics |

Log Collection Troubleshooting

Logs not appearing in Log Analytics?
│
├─ Is logging enabled on the resource?
│  └─ Compute: is oci-compute-agent running? (systemctl status oracle-cloud-agent)
│  └─ Functions: is logging enabled in function configuration?
│
├─ Is Service Connector configured and ACTIVE?
│  └─ Source: Log Group → Target: Log Analytics
│  └─ Check status: oci sch service-connector get --id <ocid>
│
├─ IAM policy for Service Connector?
│  └─ "Allow any-user to use log-content in tenancy"
│  └─ "Allow service loganalytics to READ logcontent in tenancy"
│  └─ Missing EITHER policy causes silent failure
│
└─ 10–15 minute ingestion lag?
   └─ Wait before concluding logs are missing

Metric Query Performance

Unfiltered queries scan ALL resources in compartment — slow and consumes rate limit budget.

# Expensive: scans all instances
CPUUtilization[1m].mean()

# Optimized: filter to specific instance
CPUUtilization[1m]{resourceId='<instance-ocid>'}.mean()

Rate limit: 1000 metric queries/minute per tenancy. Dashboard with many unfiltered widgets can exhaust this.

Progressive Loading Reference

Load references/oci-monitoring-reference.md when:

  • Need the complete list of OCI service metric namespaces and metric names
  • Writing complex MQL expressions (composites, functions, grouping)
  • Implementing composite alarm conditions
  • Setting up Log Analytics workspace, APM, or Service Connector Hub in detail

Do NOT load for alarm threshold patterns, namespace gotchas, or log troubleshooting — this file covers those.