From aa4dc6ef13f80501d8eb14fc3089ea5452b800f9 Mon Sep 17 00:00:00 2001 From: songzhendong Date: Wed, 24 Jun 2026 20:33:43 +0800 Subject: [PATCH] Add PHP runtime PHM meter analyzer, e2e, and documentation. Introduce php-runtime.yaml MAL rules for six instance_php_* meters, MAL tests, PHP e2e meter assertions, and backend docs. Pin SW_AGENT_PHP_COMMIT to apache/skywalking-php@de311c9 (PHM #145). Use metrics_report_period=31 in e2e php.ini so meter gRPC streams idle-close and OAP persists PHM data. --- docs/en/changes/changes.md | 5 + docs/en/setup/backend/backend-meter.md | 1 + .../setup/backend/dashboards-php-runtime.md | 73 ++++++++++++++ docs/menu.yml | 2 + .../php-runtime.data.yaml | 96 +++++++++++++++++++ .../src/main/resources/application.yml | 2 +- .../meter-analyzer-config/php-runtime.yaml | 33 +++++++ test/e2e-v2/cases/php/e2e.yaml | 31 ++++++ test/e2e-v2/cases/php/php.ini | 3 + .../cases/storage/expected/config-dump.yml | 2 +- test/e2e-v2/script/env | 2 +- 11 files changed, 247 insertions(+), 3 deletions(-) create mode 100644 docs/en/setup/backend/dashboards-php-runtime.md create mode 100644 oap-server/analyzer/meter-analyzer-scripts-test/src/test/resources/scripts/mal/test-meter-analyzer-config/php-runtime.data.yaml create mode 100644 oap-server/server-starter/src/main/resources/meter-analyzer-config/php-runtime.yaml diff --git a/docs/en/changes/changes.md b/docs/en/changes/changes.md index 82debb79ee04..3ef88c95d8a0 100644 --- a/docs/en/changes/changes.md +++ b/docs/en/changes/changes.md @@ -245,6 +245,10 @@ admin-host only" entry above for the public REST retirement. #### OAP Server +* Add PHP runtime PHM meter analyzer (`php-runtime.yaml`) for SkyWalking PHP agent process + metrics (CPU, memory, virtual memory, thread count, open file descriptors sampled from + `/proc` on Linux). Registers six `meter_instance_php_*` metrics on the General Service + layer; `php-runtime` is included in the default `meterAnalyzerActiveFiles`. * Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background **before** the rule row is persisted and dispatch resumes — it gates the persist + local commit + peer resume so the durable commit point is only reached once the schema is confirmed cluster-wide, and writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence). * Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates everything durable or visible**: because an un-propagated write is silently dropped at the data node, the order after a successful DDL is suspend → DDL → **fence → persist → commit → resume**. The rule row (the durable commit point) is written only AFTER the fence confirms, so "durable" implies "schema propagated cluster-wide" — a main crash before persist leaves no row (peers/crash-recovery stay safely on the old content; the orphaned measure is inert), and any durable row is guaranteed fence-confirmed, so convergence never resumes dispatch against an unpropagated schema. The fence + persist + resume run in the background so they never block the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (accepted, not yet durable; dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`/`failed`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (a durable row is, by the fence-then-persist order, already propagation-confirmed). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart. * Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable. @@ -328,5 +332,6 @@ * Add WeChat / Alipay Mini Program monitoring setup documentation, plus a client-side-monitoring section in the security guide covering public-internet ingress (OTLP + `/v3/segments`) for mobile / browser / mini-program SDKs. * Improve downsampling documentation * Fix the docker-compose quickstart: OAP healthcheck no longer calls `curl` (absent from the JRE image) and probes the query port via bash `/dev/tcp`; the Horizon UI service maps the correct container port (8081) and mounts a `horizon.yaml` (binding `0.0.0.0`, OAP URLs, demo `admin`/`admin` login) instead of non-existent `SW_*_ADDRESS` env vars. +* Add PHP runtime metrics (PHM) dashboard documentation (agent setup, OAP `php-runtime` MAL rules, Horizon UI widgets). All issues and pull requests are [here](https://github.com/apache/skywalking/issues?q=milestone:11.0.0) diff --git a/docs/en/setup/backend/backend-meter.md b/docs/en/setup/backend/backend-meter.md index cf4ee971dc38..fa1bf25f5d4b 100644 --- a/docs/en/setup/backend/backend-meter.md +++ b/docs/en/setup/backend/backend-meter.md @@ -53,6 +53,7 @@ All following agents and components have built-in meters reporting to the OAP th 5. Java agent for thread-pool metrics 6. Rover(eBPF) agent for metrics used continues profiling 7. Satellite proxy self-observability metrics +8. PHP agent for PHM (PHP Health Metrics) runtime metrics — **Linux only** (`/proc` sampling) ## Configuration file diff --git a/docs/en/setup/backend/dashboards-php-runtime.md b/docs/en/setup/backend/dashboards-php-runtime.md new file mode 100644 index 000000000000..2c1c90510508 --- /dev/null +++ b/docs/en/setup/backend/dashboards-php-runtime.md @@ -0,0 +1,73 @@ +# PHP runtime metrics (PHM) + +The SkyWalking PHP agent can report **PHP Health Metrics (PHM)** through the native Meter protocol. +OAP parses them with MAL rules in `php-runtime.yaml` and stores +them as `meter_*` metrics on the **General Service** layer. + +Requires a PHP agent build that includes PHM (merged in `apache/skywalking-php` 1.2.0+). + +## Platform support + +PHM process meters are **Linux only**. In `grpc` / `kafka` reporter mode, the forked reporter worker +samples the **parent PHP process** (`getppid()`) through `/proc` (`/proc/{pid}/status`, `stat`, and +`fd`). They are not collected on macOS or Windows, and PHM does not run when `reporter_type = +standalone`. Instance dashboard widgets stay hidden when no `meter_instance_php_*` data exists. + +## Data flow + +1. PHM is **On by default on Linux** when the agent is active (`skywalking_agent.enable = On`). Set + `skywalking_agent.metrics_enable = Off` to disable. +2. The forked reporter worker boots `skywalking::metrics::Metricer` in `start_worker` and samples + `/proc` on `metrics_report_period` (default 30 seconds). No HTTP traffic is required. +3. OAP loads `meter-analyzer-config/php-runtime.yaml` when `php-runtime` is listed in + `agent-analyzer.default.meterAnalyzerActiveFiles`. +4. Horizon UI renders widgets on **General Service → Instance → Dashboard** when the corresponding + `meter_instance_php_*` metrics exist. + +## Agent setup + +```ini +; Default On on Linux when the agent is active. +; skywalking_agent.metrics_enable = Off + +skywalking_agent.metrics_report_period = 30 +``` + +Refer to the PHP agent README and INI settings documentation for details. + +## OAP setup + +Ensure `php-runtime` is active (included by default when using the stock `application.yml`): + +```yaml +agent-analyzer: + default: + meterAnalyzerActiveFiles: ...,php-runtime,... +``` + +## UI location + +**Layer:** General Service (`GENERAL`) + +**Path:** select a PHP service → **Instance** → **Dashboard** + +Widgets appear only when PHM data is present (`visibleWhen` checks each `meter_instance_php_*` expression). + +## Runtime metrics + +Agent meter names (reported by the PHP agent) are rewritten by OAP MAL `metricPrefix: meter`: + +| Unit | Agent meter name | OAP / UI metric name | Description | Data Source | +|-------|------------------------------------------|--------------------------------------------|-----------------------------------------|----------------------| +| % | instance_php_process_cpu_utilization | meter_instance_php_process_cpu_utilization | Process CPU utilization | SkyWalking PHP Agent | +| MB | instance_php_memory_used_mb | meter_instance_php_memory_used_mb | Resident memory (VmRSS from /proc) | SkyWalking PHP Agent | +| MB | instance_php_memory_peak_mb | meter_instance_php_memory_peak_mb | Peak resident memory (VmHWM from /proc) | SkyWalking PHP Agent | +| MB | instance_php_virtual_memory_mb | meter_instance_php_virtual_memory_mb | Virtual memory (VmSize from /proc) | SkyWalking PHP Agent | +| — | instance_php_thread_count | meter_instance_php_thread_count | OS thread count (Threads from /proc) | SkyWalking PHP Agent | +| — | instance_php_open_fd_count | meter_instance_php_open_fd_count | Open file descriptor count | SkyWalking PHP Agent | + +## Customizations + +You can customize MAL expressions or dashboard panels. Metric definitions and expression rules are in +`/meter-analyzer-config/php-runtime.yaml`. Instance dashboard widget templates ship from the +SkyWalking Horizon UI bundle (`general.json` in apache/skywalking-horizon-ui). diff --git a/docs/menu.yml b/docs/menu.yml index 542640bf65a5..8fee302d1416 100644 --- a/docs/menu.yml +++ b/docs/menu.yml @@ -260,6 +260,8 @@ catalog: path: "/en/setup/backend/backend-zabbix" - name: "Meter Analysis" path: "/en/setup/backend/backend-meter" + - name: "PHP runtime metrics (PHM)" + path: "/en/setup/backend/dashboards-php-runtime" - name: "Telegraf Metrics" path: "/en/setup/backend/telegraf-receiver" - name: "Apdex Threshold" diff --git a/oap-server/analyzer/meter-analyzer-scripts-test/src/test/resources/scripts/mal/test-meter-analyzer-config/php-runtime.data.yaml b/oap-server/analyzer/meter-analyzer-scripts-test/src/test/resources/scripts/mal/test-meter-analyzer-config/php-runtime.data.yaml new file mode 100644 index 000000000000..3a44d8b3a4bd --- /dev/null +++ b/oap-server/analyzer/meter-analyzer-scripts-test/src/test/resources/scripts/mal/test-meter-analyzer-config/php-runtime.data.yaml @@ -0,0 +1,96 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +script: oap-server/server-starter/src/main/resources/meter-analyzer-config/php-runtime.yaml +input: + instance_php_process_cpu_utilization: + - labels: + instance: test-instance + value: 100.0 + instance_php_memory_used_mb: + - labels: + instance: test-instance + value: 256.0 + instance_php_memory_peak_mb: + - labels: + instance: test-instance + value: 512.0 + instance_php_virtual_memory_mb: + - labels: + instance: test-instance + value: 1024.0 + instance_php_thread_count: + - labels: + instance: test-instance + value: 4.0 + instance_php_open_fd_count: + - labels: + instance: test-instance + value: 32.0 +expected: + meter_instance_php_process_cpu_utilization: + entities: + - scope: SERVICE_INSTANCE + instance: test-instance + layer: GENERAL + samples: + - labels: + instance: test-instance + value: 100.0 + meter_instance_php_memory_used_mb: + entities: + - scope: SERVICE_INSTANCE + instance: test-instance + layer: GENERAL + samples: + - labels: + instance: test-instance + value: 256.0 + meter_instance_php_memory_peak_mb: + entities: + - scope: SERVICE_INSTANCE + instance: test-instance + layer: GENERAL + samples: + - labels: + instance: test-instance + value: 512.0 + meter_instance_php_virtual_memory_mb: + entities: + - scope: SERVICE_INSTANCE + instance: test-instance + layer: GENERAL + samples: + - labels: + instance: test-instance + value: 1024.0 + meter_instance_php_thread_count: + entities: + - scope: SERVICE_INSTANCE + instance: test-instance + layer: GENERAL + samples: + - labels: + instance: test-instance + value: 4.0 + meter_instance_php_open_fd_count: + entities: + - scope: SERVICE_INSTANCE + instance: test-instance + layer: GENERAL + samples: + - labels: + instance: test-instance + value: 32.0 diff --git a/oap-server/server-starter/src/main/resources/application.yml b/oap-server/server-starter/src/main/resources/application.yml index d38269cfd303..97cd27ee766d 100644 --- a/oap-server/server-starter/src/main/resources/application.yml +++ b/oap-server/server-starter/src/main/resources/application.yml @@ -229,7 +229,7 @@ agent-analyzer: # Nginx and Envoy agents can't get the real remote address. # Exit spans with the component in the list would not generate the client-side instance relation metrics. noUpstreamRealAddressAgents: ${SW_NO_UPSTREAM_REAL_ADDRESS:6000,9000} - meterAnalyzerActiveFiles: ${SW_METER_ANALYZER_ACTIVE_FILES:datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime} # Which files could be meter analyzed, files split by "," + meterAnalyzerActiveFiles: ${SW_METER_ANALYZER_ACTIVE_FILES:datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime,php-runtime} # Which files could be meter analyzed, files split by "," slowCacheReadThreshold: ${SW_SLOW_CACHE_SLOW_READ_THRESHOLD:default:20,redis:10} # The slow cache read operation thresholds. Unit ms. slowCacheWriteThreshold: ${SW_SLOW_CACHE_SLOW_WRITE_THRESHOLD:default:20,redis:10} # The slow cache write operation thresholds. Unit ms. diff --git a/oap-server/server-starter/src/main/resources/meter-analyzer-config/php-runtime.yaml b/oap-server/server-starter/src/main/resources/meter-analyzer-config/php-runtime.yaml new file mode 100644 index 000000000000..ddc605ae77f3 --- /dev/null +++ b/oap-server/server-starter/src/main/resources/meter-analyzer-config/php-runtime.yaml @@ -0,0 +1,33 @@ +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +expSuffix: instance(['service'], ['instance'], Layer.GENERAL) +metricPrefix: meter +metricsRules: + # CPU + - name: instance_php_process_cpu_utilization + exp: instance_php_process_cpu_utilization + # Memory + - name: instance_php_memory_used_mb + exp: instance_php_memory_used_mb + - name: instance_php_memory_peak_mb + exp: instance_php_memory_peak_mb + - name: instance_php_virtual_memory_mb + exp: instance_php_virtual_memory_mb + # Process resources + - name: instance_php_thread_count + exp: instance_php_thread_count + - name: instance_php_open_fd_count + exp: instance_php_open_fd_count diff --git a/test/e2e-v2/cases/php/e2e.yaml b/test/e2e-v2/cases/php/e2e.yaml index 0c6c3edff2f0..93e10f6fd185 100644 --- a/test/e2e-v2/cases/php/e2e.yaml +++ b/test/e2e-v2/cases/php/e2e.yaml @@ -107,3 +107,34 @@ verify: swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ --service-name=php --dest-instance-name=provider1 --dest-service-name=e2e-service-provider expected: expected/metrics-has-value.yml + # PHP Health Metrics (PHM) — instance meters via native MeterReportService + - query: | + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_process_cpu_utilization --instance-name=$( \ + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ + --service-name=php + expected: expected/metrics-has-value.yml + - query: | + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_memory_used_mb --instance-name=$( \ + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ + --service-name=php + expected: expected/metrics-has-value.yml + - query: | + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_memory_peak_mb --instance-name=$( \ + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ + --service-name=php + expected: expected/metrics-has-value.yml + - query: | + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_virtual_memory_mb --instance-name=$( \ + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ + --service-name=php + expected: expected/metrics-has-value.yml + - query: | + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_thread_count --instance-name=$( \ + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ + --service-name=php + expected: expected/metrics-has-value.yml + - query: | + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_open_fd_count --instance-name=$( \ + swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \ + --service-name=php + expected: expected/metrics-has-value.yml diff --git a/test/e2e-v2/cases/php/php.ini b/test/e2e-v2/cases/php/php.ini index ab601220113f..985702dad891 100644 --- a/test/e2e-v2/cases/php/php.ini +++ b/test/e2e-v2/cases/php/php.ini @@ -25,3 +25,6 @@ skywalking_agent.server_addr = oap:11800 skywalking_agent.skywalking_version = 9 skywalking_agent.runtime_dir = /tmp +skywalking_agent.metrics_enable = 1 +; CI: 31s (> skywalking-rs 30s stream idle) so OAP meter process() runs +skywalking_agent.metrics_report_period = 31 diff --git a/test/e2e-v2/cases/storage/expected/config-dump.yml b/test/e2e-v2/cases/storage/expected/config-dump.yml index 960e6efc809b..f847b69d6652 100644 --- a/test/e2e-v2/cases/storage/expected/config-dump.yml +++ b/test/e2e-v2/cases/storage/expected/config-dump.yml @@ -32,7 +32,7 @@ "admin-server.default.port": "17128", "admin-server.provider": "default", "agent-analyzer.default.forceSampleErrorSegment": "true", - "agent-analyzer.default.meterAnalyzerActiveFiles": "datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime", + "agent-analyzer.default.meterAnalyzerActiveFiles": "datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime,php-runtime", "agent-analyzer.default.noUpstreamRealAddressAgents": "6000,9000", "agent-analyzer.default.segmentStatusAnalysisStrategy": "FROM_SPAN_STATUS", "agent-analyzer.default.slowCacheReadThreshold": "default:20,redis:10", diff --git a/test/e2e-v2/script/env b/test/e2e-v2/script/env index 80df25305174..de74d3dd6408 100644 --- a/test/e2e-v2/script/env +++ b/test/e2e-v2/script/env @@ -24,7 +24,7 @@ SW_AGENT_CLIENT_JS_TEST_COMMIT=4f1eb1dcdbde3ec4a38534bf01dded4ab5d2f016 SW_KUBERNETES_COMMIT_SHA=da0e267f877b9b8e5f7728ae4ea7dc7723a2a073 SW_ROVER_COMMIT=79292fe07f17f98f486e0c4471213e1961fb2d1d SW_BANYANDB_COMMIT=c2d925e4eae4d77edda94e1fd438243483960150 -SW_AGENT_PHP_COMMIT=d1114e7be5d89881eec76e5b56e69ff844691e35 +SW_AGENT_PHP_COMMIT=de311c9cd084e21becade0742cd289bc0f43181d SW_PREDICTOR_COMMIT=54a0197654a3781a6f73ce35146c712af297c994 SW_CTL_COMMIT=85e5afdb3d55c6e5af66a472c3fe8ac024d11690