Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/en/changes/changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,10 @@
admin-host only" entry above for the public REST retirement.

#### OAP Server
* Add PHP runtime PHM meter analyzer (`php-runtime.yaml`) for SkyWalking PHP agent process
metrics (CPU, memory, virtual memory, thread count, open file descriptors sampled from
`/proc` on Linux). Registers six `meter_instance_php_*` metrics on the General Service
layer; `php-runtime` is included in the default `meterAnalyzerActiveFiles`.
* Batch the BanyanDB schema fence per runtime-rule apply. A runtime-rule file changes dozens of rules at once, but the post-DDL fence (`SchemaWatcher.awaitRevisionApplied`) ran once per metric/downsampling, so a large file did `K×M` sequential ≤2s fences — on a laggy cluster that overran the apply's REST budget. The main-node apply path now uses `StorageManipulationOpt.withSchemaChangeDeferredFence()`: the installer records each resource's `mod_revision` without fencing and registers a single flush that the apply runs once on the file's max revision, collapsing the whole file to one barrier. The flush is one-shot — a reconciler tick reuses one opt across every rule file, so after a file flushes, the closure and accumulated revision reset and each file fences on its own DDL only. Drops still fence inline on the dropped resource's own delete revision — or, when that delete recorded no tombstone (`mod_revision == 0`), on a key-based deletion barrier (`AwaitSchemaDeleted`) — never on the shared opt's cumulative revision, so a tombstone-less delete in a multi-file tick is still confirmed removed. On the operator REST apply the single create/update fence runs on a configurable, generous budget (default 180s) in the background **before** the rule row is persisted and dispatch resumes — it gates the persist + local commit + peer resume so the durable commit point is only reached once the schema is confirmed cluster-wide, and writes never resume against an un-propagated schema (see the apply-status entry below); the reconciler tick keeps the short inline 2s fence (a background reconcile must not wait minutes per file). Peer / `withoutSchemaChange` applies are unaffected (no fence).
* Add a runtime-rule apply-status query. The cluster main now tracks each structural apply through a phase machine (`SchemaApplyCoordinator`: pending → DDL → fencing → rolling-out → applied, with `degraded` for a committed-but-unconfirmed apply — the cluster schema fence did not confirm within the timeout, in which case the lagging data-node ids are surfaced as `fenceLaggards` and dispatch is resumed anyway, or the local commit-tail threw — and `failed` carrying the specific reason). The schema fence runs on a configurable, generous budget (`receiver-runtime-rule.deferredFenceTimeoutSeconds`, default 180s) and **gates everything durable or visible**: because an un-propagated write is silently dropped at the data node, the order after a successful DDL is suspend → DDL → **fence → persist → commit → resume**. The rule row (the durable commit point) is written only AFTER the fence confirms, so "durable" implies "schema propagated cluster-wide" — a main crash before persist leaves no row (peers/crash-recovery stay safely on the old content; the orphaned measure is inert), and any durable row is guaranteed fence-confirmed, so convergence never resumes dispatch against an unpropagated schema. The fence + persist + resume run in the background so they never block the HTTP response — `POST /addOrUpdate` returns its `applyId` immediately at `fencing` (accepted, not yet durable; dispatch for that rule still paused — a clean gap, not dropped writes), and the operator polls `GET /runtime/rule/status` to watch `fencing → rolling-out → applied` (or `degraded`/`failed`); on a genuine laggard, dispatch resumes after the budget so one stuck node can't park the metric forever. A `GetApplyStatus` admin-internal gRPC served by the main backs the query — by `applyId`, or by `catalog`+`name` (+ optional `contentHash`, the durable identity) once the handle is gone after a page refresh. When the live status is gone (apply-id evicted, main restarted, or the main is unreachable), the query degrades to the durable rule row: a matching `ACTIVE` row reports `applied` derived from the content hash (a durable row is, by the fence-then-persist order, already propagation-confirmed). Non-main nodes route the read to the deterministic main; status is in-memory by design, with the content hash reconstructing truth after a restart.
* Push runtime-rule convergence to peers on commit. After a successful structural apply — and on the `commit_deferred` path, where the DB row is durable but this node's commit-tail threw — the main broadcasts a `NotifyApplied` admin-internal RPC so peers reconcile against the just-persisted DB row immediately, instead of waiting up to one refresh tick (~30s) to notice it. The fan-out runs off the REST response thread (fire-and-forget on a daemon executor) so an unreachable peer's per-call deadline never adds to the operator's apply latency. On the peer side the notify-triggered reconcile is coalesced: a burst of notifies (a multi-rule file, or several applies) collapses to a single queued full reconcile rather than one redundant `dao.getAll()` scan per notify. The notify is best-effort and idempotent (the peer runs its normal per-file-locked reconcile; a lost notify is harmless — the peer still self-converges on its next tick), so it tightens the cluster-convergence window without adding a hard dependency on the main being reachable.
Expand Down Expand Up @@ -328,5 +332,6 @@
* Add WeChat / Alipay Mini Program monitoring setup documentation, plus a client-side-monitoring section in the security guide covering public-internet ingress (OTLP + `/v3/segments`) for mobile / browser / mini-program SDKs.
* Improve downsampling documentation
* Fix the docker-compose quickstart: OAP healthcheck no longer calls `curl` (absent from the JRE image) and probes the query port via bash `/dev/tcp`; the Horizon UI service maps the correct container port (8081) and mounts a `horizon.yaml` (binding `0.0.0.0`, OAP URLs, demo `admin`/`admin` login) instead of non-existent `SW_*_ADDRESS` env vars.
* Add PHP runtime metrics (PHM) dashboard documentation (agent setup, OAP `php-runtime` MAL rules, Horizon UI widgets).

All issues and pull requests are [here](https://github.com/apache/skywalking/issues?q=milestone:11.0.0)
1 change: 1 addition & 0 deletions docs/en/setup/backend/backend-meter.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,7 @@ All following agents and components have built-in meters reporting to the OAP th
5. Java agent for thread-pool metrics
6. Rover(eBPF) agent for metrics used continues profiling
7. Satellite proxy self-observability metrics
8. PHP agent for PHM (PHP Health Metrics) runtime metrics — **Linux only** (`/proc` sampling)

## Configuration file

Expand Down
73 changes: 73 additions & 0 deletions docs/en/setup/backend/dashboards-php-runtime.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# PHP runtime metrics (PHM)

The SkyWalking PHP agent can report **PHP Health Metrics (PHM)** through the native Meter protocol.
OAP parses them with MAL rules in `php-runtime.yaml` and stores
them as `meter_*` metrics on the **General Service** layer.

Requires a PHP agent build that includes PHM (merged in `apache/skywalking-php` 1.2.0+).

## Platform support

PHM process meters are **Linux only**. In `grpc` / `kafka` reporter mode, the forked reporter worker
samples the **parent PHP process** (`getppid()`) through `/proc` (`/proc/{pid}/status`, `stat`, and
`fd`). They are not collected on macOS or Windows, and PHM does not run when `reporter_type =
standalone`. Instance dashboard widgets stay hidden when no `meter_instance_php_*` data exists.

## Data flow

1. PHM is **On by default on Linux** when the agent is active (`skywalking_agent.enable = On`). Set
`skywalking_agent.metrics_enable = Off` to disable.
2. The forked reporter worker boots `skywalking::metrics::Metricer` in `start_worker` and samples
`/proc` on `metrics_report_period` (default 30 seconds). No HTTP traffic is required.
3. OAP loads `meter-analyzer-config/php-runtime.yaml` when `php-runtime` is listed in
`agent-analyzer.default.meterAnalyzerActiveFiles`.
4. Horizon UI renders widgets on **General Service → Instance → Dashboard** when the corresponding
`meter_instance_php_*` metrics exist.

## Agent setup

```ini
; Default On on Linux when the agent is active.
; skywalking_agent.metrics_enable = Off

skywalking_agent.metrics_report_period = 30
```

Refer to the PHP agent README and INI settings documentation for details.

## OAP setup

Ensure `php-runtime` is active (included by default when using the stock `application.yml`):

```yaml
agent-analyzer:
default:
meterAnalyzerActiveFiles: ...,php-runtime,...
```
## UI location
**Layer:** General Service (`GENERAL`)

**Path:** select a PHP service → **Instance** → **Dashboard**

Widgets appear only when PHM data is present (`visibleWhen` checks each `meter_instance_php_*` expression).

## Runtime metrics

Agent meter names (reported by the PHP agent) are rewritten by OAP MAL `metricPrefix: meter`:

| Unit | Agent meter name | OAP / UI metric name | Description | Data Source |
|-------|------------------------------------------|--------------------------------------------|-----------------------------------------|----------------------|
| % | instance_php_process_cpu_utilization | meter_instance_php_process_cpu_utilization | Process CPU utilization | SkyWalking PHP Agent |
| MB | instance_php_memory_used_mb | meter_instance_php_memory_used_mb | Resident memory (VmRSS from /proc) | SkyWalking PHP Agent |
| MB | instance_php_memory_peak_mb | meter_instance_php_memory_peak_mb | Peak resident memory (VmHWM from /proc) | SkyWalking PHP Agent |
| MB | instance_php_virtual_memory_mb | meter_instance_php_virtual_memory_mb | Virtual memory (VmSize from /proc) | SkyWalking PHP Agent |
| — | instance_php_thread_count | meter_instance_php_thread_count | OS thread count (Threads from /proc) | SkyWalking PHP Agent |
| — | instance_php_open_fd_count | meter_instance_php_open_fd_count | Open file descriptor count | SkyWalking PHP Agent |

## Customizations

You can customize MAL expressions or dashboard panels. Metric definitions and expression rules are in
`/meter-analyzer-config/php-runtime.yaml`. Instance dashboard widget templates ship from the
SkyWalking Horizon UI bundle (`general.json` in apache/skywalking-horizon-ui).
2 changes: 2 additions & 0 deletions docs/menu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -260,6 +260,8 @@ catalog:
path: "/en/setup/backend/backend-zabbix"
- name: "Meter Analysis"
path: "/en/setup/backend/backend-meter"
- name: "PHP runtime metrics (PHM)"
path: "/en/setup/backend/dashboards-php-runtime"
- name: "Telegraf Metrics"
path: "/en/setup/backend/telegraf-receiver"
- name: "Apdex Threshold"
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

script: oap-server/server-starter/src/main/resources/meter-analyzer-config/php-runtime.yaml
input:
instance_php_process_cpu_utilization:
- labels:
instance: test-instance
value: 100.0
instance_php_memory_used_mb:
- labels:
instance: test-instance
value: 256.0
instance_php_memory_peak_mb:
- labels:
instance: test-instance
value: 512.0
instance_php_virtual_memory_mb:
- labels:
instance: test-instance
value: 1024.0
instance_php_thread_count:
- labels:
instance: test-instance
value: 4.0
instance_php_open_fd_count:
- labels:
instance: test-instance
value: 32.0
expected:
meter_instance_php_process_cpu_utilization:
entities:
- scope: SERVICE_INSTANCE
instance: test-instance
layer: GENERAL
samples:
- labels:
instance: test-instance
value: 100.0
meter_instance_php_memory_used_mb:
entities:
- scope: SERVICE_INSTANCE
instance: test-instance
layer: GENERAL
samples:
- labels:
instance: test-instance
value: 256.0
meter_instance_php_memory_peak_mb:
entities:
- scope: SERVICE_INSTANCE
instance: test-instance
layer: GENERAL
samples:
- labels:
instance: test-instance
value: 512.0
meter_instance_php_virtual_memory_mb:
entities:
- scope: SERVICE_INSTANCE
instance: test-instance
layer: GENERAL
samples:
- labels:
instance: test-instance
value: 1024.0
meter_instance_php_thread_count:
entities:
- scope: SERVICE_INSTANCE
instance: test-instance
layer: GENERAL
samples:
- labels:
instance: test-instance
value: 4.0
meter_instance_php_open_fd_count:
entities:
- scope: SERVICE_INSTANCE
instance: test-instance
layer: GENERAL
samples:
- labels:
instance: test-instance
value: 32.0
Original file line number Diff line number Diff line change
Expand Up @@ -229,7 +229,7 @@ agent-analyzer:
# Nginx and Envoy agents can't get the real remote address.
# Exit spans with the component in the list would not generate the client-side instance relation metrics.
noUpstreamRealAddressAgents: ${SW_NO_UPSTREAM_REAL_ADDRESS:6000,9000}
meterAnalyzerActiveFiles: ${SW_METER_ANALYZER_ACTIVE_FILES:datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime} # Which files could be meter analyzed, files split by ","
meterAnalyzerActiveFiles: ${SW_METER_ANALYZER_ACTIVE_FILES:datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime,php-runtime} # Which files could be meter analyzed, files split by ","
slowCacheReadThreshold: ${SW_SLOW_CACHE_SLOW_READ_THRESHOLD:default:20,redis:10} # The slow cache read operation thresholds. Unit ms.
slowCacheWriteThreshold: ${SW_SLOW_CACHE_SLOW_WRITE_THRESHOLD:default:20,redis:10} # The slow cache write operation thresholds. Unit ms.

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

expSuffix: instance(['service'], ['instance'], Layer.GENERAL)
metricPrefix: meter
metricsRules:
# CPU
- name: instance_php_process_cpu_utilization
exp: instance_php_process_cpu_utilization
# Memory
- name: instance_php_memory_used_mb
exp: instance_php_memory_used_mb
- name: instance_php_memory_peak_mb
exp: instance_php_memory_peak_mb
- name: instance_php_virtual_memory_mb
exp: instance_php_virtual_memory_mb
# Process resources
- name: instance_php_thread_count
exp: instance_php_thread_count
- name: instance_php_open_fd_count
exp: instance_php_open_fd_count
31 changes: 31 additions & 0 deletions test/e2e-v2/cases/php/e2e.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,34 @@ verify:
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php --dest-instance-name=provider1 --dest-service-name=e2e-service-provider
expected: expected/metrics-has-value.yml
# PHP Health Metrics (PHM) — instance meters via native MeterReportService
- query: |
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_process_cpu_utilization --instance-name=$( \
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php
expected: expected/metrics-has-value.yml
- query: |
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_memory_used_mb --instance-name=$( \
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php
expected: expected/metrics-has-value.yml
- query: |
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_memory_peak_mb --instance-name=$( \
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php
expected: expected/metrics-has-value.yml
- query: |
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_virtual_memory_mb --instance-name=$( \
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php
expected: expected/metrics-has-value.yml
- query: |
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_thread_count --instance-name=$( \
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php
expected: expected/metrics-has-value.yml
- query: |
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql metrics exec --expression=meter_instance_php_open_fd_count --instance-name=$( \
swctl --display yaml --base-url=http://${oap_host}:${oap_12800}/graphql instance list --service-name=php | yq e '.[0].name' - ) \
--service-name=php
expected: expected/metrics-has-value.yml
3 changes: 3 additions & 0 deletions test/e2e-v2/cases/php/php.ini
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,6 @@ skywalking_agent.server_addr = oap:11800
skywalking_agent.skywalking_version = 9
skywalking_agent.runtime_dir = /tmp

skywalking_agent.metrics_enable = 1
; CI: 31s (> skywalking-rs 30s stream idle) so OAP meter process() runs
skywalking_agent.metrics_report_period = 31
2 changes: 1 addition & 1 deletion test/e2e-v2/cases/storage/expected/config-dump.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@
"admin-server.default.port": "17128",
"admin-server.provider": "default",
"agent-analyzer.default.forceSampleErrorSegment": "true",
"agent-analyzer.default.meterAnalyzerActiveFiles": "datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime",
"agent-analyzer.default.meterAnalyzerActiveFiles": "datasource,threadpool,satellite,go-runtime,python-runtime,continuous-profiling,java-agent,go-agent,ruby-runtime,php-runtime",
"agent-analyzer.default.noUpstreamRealAddressAgents": "6000,9000",
"agent-analyzer.default.segmentStatusAnalysisStrategy": "FROM_SPAN_STATUS",
"agent-analyzer.default.slowCacheReadThreshold": "default:20,redis:10",
Expand Down
Loading
Loading