Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/nubskr/walrus/llms.txt

Use this file to discover all available pages before exploring further.

This guide covers common issues, diagnostic steps, and solutions for operating Walrus in production.

Quick Diagnostics

When encountering issues, start with these diagnostic commands:
# Check cluster health
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics

# Check topic state
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs

# Test write operation
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put test "diagnostic write"

# Test read operation
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get test

Common Issues

No Raft Leader Elected

Symptoms:
  • METRICS shows "current_leader": null
  • All nodes report state: "Candidate" or state: "Follower"
  • Write operations fail with leader errors
Causes:
  1. Less than quorum (< 50%) of nodes are running
  2. Network partition preventing consensus
  3. Clock skew between nodes
  4. Bootstrap node didn’t initialize properly
1

Check node count

# Ensure at least 2 of 3 nodes are running (or 3 of 5)
ps aux | grep walrus
For a 3-node cluster, at least 2 nodes must be running for quorum.
2

Check network connectivity

# From node 2, test connectivity to node 1
nc -zv 127.0.0.1 6001

# Check if Raft ports are reachable
telnet node1-addr 6001
3

Check bootstrap status

# Node 1 should show as initial leader
curl http://127.0.0.1:9091/metrics | jq '.state'
If node 1 was not bootstrapped correctly, restart with proper flags:
cargo run -- --node-id 1 --raft-port 6001 --client-port 9091
# Do NOT use --join on the first node
4

Check clock synchronization

# Install and enable NTP
sudo systemctl status chronyd  # or ntpd

# Check time on all nodes
date
Clock skew > 1 second can cause Raft consensus issues. Use NTP to synchronize clocks.
Solution:
# If cluster is stuck, restart all nodes in order:

# 1. Stop all nodes
pkill walrus

# 2. Start node 1 (bootstrap leader)
cargo run -- --node-id 1 --raft-port 6001 --client-port 9091 &

# 3. Wait 5 seconds for node 1 to initialize
sleep 5

# 4. Start node 2
cargo run -- --node-id 2 --raft-port 6002 --client-port 9092 --join 127.0.0.1:6001 &

# 5. Start node 3
cargo run -- --node-id 3 --raft-port 6003 --client-port 9093 --join 127.0.0.1:6001 &

Write Operations Failing

Symptoms:
  • PUT commands return ERR NotLeaderError
  • METRICS shows leader elected but writes still fail
Causes:
  1. Lease synchronization not complete
  2. Writing to a sealed segment
  3. Incorrect topic routing
Leases are synchronized every 100ms. Wait and retry:
# Wait 200ms and retry
sleep 0.2
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "test"
# Verify topic metadata
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs
Look for:
  • current_segment: Active segment ID
  • leader_node: Node responsible for writes
If metadata is inconsistent, the cluster may need time to converge after a rollover.
# Find the leader
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq '.current_leader'

# Connect directly to leader's client port
# If leader is node 2 (port 9092)
cargo run --bin walrus-cli -- --addr 127.0.0.1:9092 put logs "test"

Read Operations Return EMPTY

Symptoms:
  • GET commands return EMPTY despite previous writes
  • Writes succeeded but reads find no data
Diagnostic:
# Check if topic was created
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs

# Verify writes were successful
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "test write"
# Should return: OK

# Try reading again
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs
Causes & Solutions:
Cause: Read cursor is at the end of available dataCheck:
# Get topic state
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs
If cursor has already consumed all entries, write new data:
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "new entry"
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs

Follower Replication Lag

Symptoms:
  • METRICS shows large gap between match_index and last_log_index
  • Follower consistently behind leader
Diagnostic:
# Check replication status on leader
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq '.replication'

# Example output showing lag:
# "2": {
#   "match_index": 100,    <- Follower at entry 100
#   "next_index": 101
# }
# "last_log_index": 500    <- Leader at entry 500
# LAG: 400 entries
Causes & Solutions:
Check network latency:
# Measure RTT between nodes
ping -c 10 node2-addr

# Check for packet loss
mtr node2-addr
Solution:
  • Improve network connectivity between nodes
  • Use dedicated network for Raft traffic
  • Consider using higher bandwidth links
Check CPU and disk I/O:
# On the lagging follower
top -n 1
iostat -x 1 5
Solution:
  • Reduce load on follower node
  • Add more nodes to distribute load
  • Upgrade hardware (faster disks, more CPU)
Cause: Large snapshots or batch metadata changesSolution: Wait for replication to catch up
# Monitor replication progress
watch -n 1 'cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq ".replication"'
Large lags (>1000 entries) may take several minutes to resolve. This is normal after cluster restarts or large metadata changes.

Segment Rollover Not Happening

Symptoms:
  • Segment has exceeded WALRUS_MAX_SEGMENT_ENTRIES but no rollover
  • STATE shows current segment with very high entry count
Diagnostic:
# Check if monitor is running
ps aux | grep walrus
# Look for "Monitor loop started" in logs

# Check current segment size
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.current_segment'

# Check environment variable
echo $WALRUS_MAX_SEGMENT_ENTRIES
Causes & Solutions:
1

Monitor Loop Not Running

Check logs for “Monitor loop started”:
tail -f /var/log/walrus/node-1.log | grep "Monitor"
If missing, the monitor task may have panicked. Restart the node.
2

Wrong Node Checking

Only the current segment leader triggers rollover. Verify leadership:
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.leader_node'
Check the leader node’s logs for rollover attempts.
3

Raft Consensus Failure

Rollover requires Raft consensus. Check if proposals are failing:
# Look for Raft errors in logs
tail -f /var/log/walrus/node-1.log | grep -i "raft.*error"
If Raft is unhealthy, resolve Raft issues first (see “No Raft Leader” section).
4

Force Rollover Test

Lower the threshold temporarily to test rollover:
# Set very low threshold
export WALRUS_MAX_SEGMENT_ENTRIES=10

# Restart the leader node
# Write 20 entries and verify rollover happens

Disk Space Issues

Symptoms:
  • Writes start failing
  • “No space left on device” errors
  • WAL files consuming excessive disk space
Check disk usage:
# Check overall disk usage
df -h /path/to/data

# Check per-node usage
du -sh /path/to/data/node_*

# List largest WAL files
find /path/to/data -type f -name "[0-9]*" -exec ls -lh {} \; | sort -k5 -hr | head -20
Solutions:
Walrus currently has no automatic retention. You must manually delete old segments.
# Identify old sealed segments
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.sealed_segments'

# Manually remove old segment files (be careful!)
# Only delete files for segments you no longer need

# Find segment files by timestamp
ls -lt /path/to/data/node_1/user_data/data_plane/
Future versions will include automatic retention policies. For now, implement external cleanup scripts.

io_uring Errors

Symptoms:
  • Batch operations failing with “Unsupported” errors
  • Lower than expected throughput
  • Errors mentioning “io_uring”
Check io_uring support:
# Check kernel version (need 5.1+)
uname -r

# Check if liburing is installed
ldconfig -p | grep liburing

# Check system limits
cat /proc/sys/fs/aio-max-nr
Solutions:
Upgrade kernel to 5.6 or later:
# On Ubuntu/Debian
sudo apt-get update
sudo apt-get install linux-generic-hwe-$(lsb_release -rs)

# Reboot into new kernel
sudo reboot
io_uring is stable in Linux 5.6+. Earlier versions (5.1-5.5) have limited support.
Install liburing:
# On Ubuntu/Debian
sudo apt-get install liburing-dev

# On RHEL/CentOS
sudo yum install liburing-devel

# Rebuild Walrus
cargo build --release
If io_uring is not available, disable the FD backend:
export WALRUS_DISABLE_IO_URING=1

# Or in code
use walrus_rust::disable_fd_backend;
disable_fd_backend();
This will reduce batch operation throughput by 3-10x.

High CPU Usage

Symptoms:
  • Walrus processes consuming >80% CPU
  • System slow or unresponsive
Diagnostic:
# Check CPU usage
top -H -p $(pgrep walrus)

# Profile with perf
sudo perf top -p $(pgrep walrus)
Common Causes:
Monitor loop too frequent:
echo $WALRUS_MONITOR_CHECK_MS
# If very low (<1000), increase it
export WALRUS_MONITOR_CHECK_MS=10000

Performance Issues

Low Write Throughput

Expected: 100k-1M writes/sec without fsync, 5k-10k writes/sec with fsync If below expected:
1

Verify io_uring is enabled

# Should NOT be set
echo $WALRUS_DISABLE_IO_URING

# Check logs for io_uring usage
tail -f /var/log/walrus/node-1.log | grep -i "uring"
2

Use batch operations

// BAD: Single writes in a loop
for entry in entries {
    wal.append_for_topic("logs", entry)?;
}

// GOOD: Batch write
wal.batch_append_for_topic("logs", &entries)?;
3

Reduce fsync frequency

# Current configuration
cargo run -- --node-id 1 ...

# With less frequent fsync
# (configure in code via FsyncSchedule)
4

Check disk performance

# Test disk write speed
dd if=/dev/zero of=/path/to/data/test bs=1M count=1000 conv=fdatasync

# Should see >500 MB/s on SSD
If disk is slow, consider:
  • Using NVMe SSD instead of SATA SSD
  • Checking I/O scheduler configuration
  • Ensuring no competing I/O workloads

High Read Latency

Expected: <1ms per read on SSD, <10ms for batch reads If higher:
  1. Check if reading from wrong node: Sealed segments are only on their original leader
  2. Use batch reads: More efficient than single reads
  3. Check disk read performance: Use iostat -x 1 to monitor disk utilization

Data Integrity Issues

Checksum Errors

Symptoms:
  • InvalidData errors when reading
  • “Checksum mismatch” in logs
Causes:
  • Disk corruption
  • Incomplete writes (power failure without fsync)
  • Bug in Walrus storage layer
Recovery:
# Identify corrupted segment
# Check logs for file path and offset

# Option 1: Delete corrupted segment (data loss)
rm /path/to/corrupted/segment/file

# Option 2: Restore from backup
cp /backup/segment/file /path/to/data/

# Option 3: Replicate from another node
# Copy segment files from a healthy node
Always use FsyncSchedule::SyncEach or frequent fsync for critical data to prevent corruption on power failure.

Cluster State Issues

Split Brain

Symptoms:
  • Multiple nodes claim to be leader
  • Writes succeed on multiple nodes for same segment
  • Inconsistent data across nodes
Split brain is a critical issue that can cause data inconsistency. Immediate action required.
Diagnostic:
# Check leader on all nodes
for port in 9091 9092 9093; do
  echo "Node on port $port:"
  cargo run --bin walrus-cli -- --addr 127.0.0.1:$port metrics | jq '{id, state, current_leader}'
done
Recovery:
  1. Stop all nodes immediately
  2. Backup data directories
  3. Identify the true leader (highest term, most recent data)
  4. Restart cluster from scratch using bootstrap node

Getting Help

If you’ve exhausted these troubleshooting steps:

GitHub Issues

Report bugs and request features

Discussions

Ask questions and share solutions

Information to Include

When reporting issues, provide:
  1. Walrus version: cargo --version and git commit hash
  2. Cluster configuration: Number of nodes, hardware specs
  3. Environment variables: Output of env | grep WALRUS
  4. Metrics output: METRICS and STATE commands
  5. Logs: Relevant error messages with context
  6. Reproduction steps: How to reproduce the issue

Next Steps

Configuration

Review configuration options

Monitoring

Set up monitoring and alerting