Monitoring Your Node

Day-to-day operation involves watching for issues before they become problems. This page covers what to monitor and how.

Key Health Indicators

Lotus Sync Status

Your Lotus node must stay synced. If it falls behind, your Piri node can't track epochs correctly and may miss proof windows.

Check Lotus sync:

lotus sync status

Look for the chain height and sync status. If significantly behind the network height, investigate immediately.

Proof Set State

Your proof set state tells you whether proving is healthy:

piri client pdp proofset state

Watch for:

Challenge issued but not proven: Your node should generate proofs promptly
Fault state: Indicates a missed challenge window—investigate why
Epochs until next challenge: Shows how much time before the next proof is due

Job Queue Health

Monitor your job queues for stuck or failed jobs:

Queue	Purpose
`replicator`	Data replication transfers
`aggregator`	Piece aggregation
`egress_tracker`	Retrieval event submission

A growing backlog or high failure rate indicates problems. Check logs for error details.

Disk Space

Monitor free space in your data directory:

df -h /path/to/data_dir

Running out of space will cause failures. Set up alerts well before capacity is reached.

System Resources

Basic system health:

CPU: Sustained high usage may indicate resource constraints
Memory: Watch for memory pressure or swapping
Network: Ensure bandwidth is sufficient for replication and retrieval

Telemetry

Piri emits OpenTelemetry metrics and traces for detailed observability.

Key Metrics

Metric	What It Tells You
`active_jobs`	Currently running jobs per queue
`queued_jobs`	Jobs waiting in queue
`failed_jobs`	Permanently failed jobs (investigate these)
`job_duration`	How long jobs take
`system_cpu_utilization`	CPU usage
`system_memory_used_bytes`	Memory usage
`piri_datadir_free_bytes`	Available disk space
`chain_current_epoch`	Current Filecoin epoch
`next_challenge_window_start_epoch`	When next challenge starts

Setting Up Metrics Collection

Configure a metrics endpoint:

[[telemetry.metrics]]
endpoint = "http://your-collector:4317"
insecure = true
publish_interval = "30s"

Send metrics to Prometheus, Grafana, or any OTLP-compatible backend.

To disable metrics reporting to Storacha, see Telemetry > How to Opt Out.

Logs

Piri logs operational events. Adjust log levels dynamically:

# Increase verbosity for a subsystem
piri client admin log set pdp debug

# List all log subsystems
piri client admin log list

When troubleshooting, increase verbosity for the relevant subsystem, reproduce the issue, then review logs.

Health Endpoint

Your node exposes a health endpoint:

curl https://your-node.example.com/health

Use this for load balancer health checks or uptime monitoring.

Alerts to Configure

Recommended alerts:

Condition	Severity	Action
Lotus sync behind by >100 epochs	Critical	Check Lotus node immediately
Proof set in fault state	Critical	Investigate missed proof
Disk space <10% free	Warning	Expand storage or clean up
Disk space <5% free	Critical	Immediate action required
Failed jobs accumulating	Warning	Check logs for root cause
No proofs submitted in proving period	Critical	Verify node is running and healthy

Regular Checks

Daily:

Verify Lotus is synced
Check proof set state for faults
Review failed job counts

Weekly:

Review disk space trends
Check for software updates (piri status upgrade-check)
Verify wallet balance for gas

Monthly:

Review overall job success rates
Check egress payment status
Consider settling storage payments if accumulated

Troubleshooting

Proofs Not Submitting

Check Lotus sync status
Verify wallet has FIL for gas
Check proof set state for errors
Review PDP task logs

Replication Failing

Check replicator queue for stuck jobs
Verify network connectivity to source
Check disk space
Review replicator logs

High Job Failure Rate

Identify which queue is failing
Check logs for specific error messages
Verify external dependencies (Lotus, network, disk)
Check if issues are transient or persistent

For detailed troubleshooting, see the specific subsystem's documentation and logs.