Troubleshooting

This guide covers common issues, diagnostic steps, and solutions for operating Walrus in production.

Quick Diagnostics

When encountering issues, start with these diagnostic commands:

# Check cluster health
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics

# Check topic state
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs

# Test write operation
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put test "diagnostic write"

# Test read operation
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get test

Common Issues

No Raft Leader Elected

Symptoms:

METRICS shows "current_leader": null
All nodes report state: "Candidate" or state: "Follower"
Write operations fail with leader errors

Causes:

Less than quorum (< 50%) of nodes are running
Network partition preventing consensus
Clock skew between nodes
Bootstrap node didn’t initialize properly

Check node count

# Ensure at least 2 of 3 nodes are running (or 3 of 5)
ps aux | grep walrus

For a 3-node cluster, at least 2 nodes must be running for quorum.

Check network connectivity

# From node 2, test connectivity to node 1
nc -zv 127.0.0.1 6001

# Check if Raft ports are reachable
telnet node1-addr 6001

Check bootstrap status

# Node 1 should show as initial leader
curl http://127.0.0.1:9091/metrics | jq '.state'

If node 1 was not bootstrapped correctly, restart with proper flags:

cargo run -- --node-id 1 --raft-port 6001 --client-port 9091
# Do NOT use --join on the first node

Check clock synchronization

# Install and enable NTP
sudo systemctl status chronyd  # or ntpd

# Check time on all nodes
date

Clock skew > 1 second can cause Raft consensus issues. Use NTP to synchronize clocks.

Solution:

# If cluster is stuck, restart all nodes in order:

# 1. Stop all nodes
pkill walrus

# 2. Start node 1 (bootstrap leader)
cargo run -- --node-id 1 --raft-port 6001 --client-port 9091 &

# 3. Wait 5 seconds for node 1 to initialize
sleep 5

# 4. Start node 2
cargo run -- --node-id 2 --raft-port 6002 --client-port 9092 --join 127.0.0.1:6001 &

# 5. Start node 3
cargo run -- --node-id 3 --raft-port 6003 --client-port 9093 --join 127.0.0.1:6001 &

Write Operations Failing

Symptoms:

PUT commands return ERR NotLeaderError
METRICS shows leader elected but writes still fail

Causes:

Lease synchronization not complete
Writing to a sealed segment
Incorrect topic routing

Solution 1: Wait for Lease Sync

Leases are synchronized every 100ms. Wait and retry:

# Wait 200ms and retry
sleep 0.2
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "test"

Solution 2: Check Topic State

# Verify topic metadata
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs

Look for:

current_segment: Active segment ID
leader_node: Node responsible for writes

If metadata is inconsistent, the cluster may need time to converge after a rollover.

Solution 3: Connect to Leader

# Find the leader
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq '.current_leader'

# Connect directly to leader's client port
# If leader is node 2 (port 9092)
cargo run --bin walrus-cli -- --addr 127.0.0.1:9092 put logs "test"

Read Operations Return EMPTY

Symptoms:

GET commands return EMPTY despite previous writes
Writes succeeded but reads find no data

Diagnostic:

# Check if topic was created
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs

# Verify writes were successful
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "test write"
# Should return: OK

# Try reading again
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs

Causes & Solutions:

Cursor Position
Wrong Node
Topic Doesn't Exist

Cause: Read cursor is at the end of available dataCheck:

# Get topic state
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs

If cursor has already consumed all entries, write new data:

cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "new entry"
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs

Cause: Reading from a node that doesn’t have the segment dataSolution: Read from the segment leader node

# Check segment leaders
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.segment_leaders'

# Connect to the node that owns segment 1
cargo run --bin walrus-cli -- --addr 127.0.0.1:9092 get logs

Cause: Topic was never createdSolution: Register the topic first

cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 register logs
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "first entry"
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs

Follower Replication Lag

Symptoms:

METRICS shows large gap between match_index and last_log_index
Follower consistently behind leader

Diagnostic:

# Check replication status on leader
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq '.replication'

# Example output showing lag:
# "2": {
#   "match_index": 100,    <- Follower at entry 100
#   "next_index": 101
# }
# "last_log_index": 500    <- Leader at entry 500
# LAG: 400 entries

Causes & Solutions:

Network Congestion

Check network latency:

# Measure RTT between nodes
ping -c 10 node2-addr

# Check for packet loss
mtr node2-addr

Solution:

Improve network connectivity between nodes
Use dedicated network for Raft traffic
Consider using higher bandwidth links

Follower Overloaded

Check CPU and disk I/O:

# On the lagging follower
top -n 1
iostat -x 1 5

Solution:

Reduce load on follower node
Add more nodes to distribute load
Upgrade hardware (faster disks, more CPU)

Large Metadata Operations

Cause: Large snapshots or batch metadata changesSolution: Wait for replication to catch up

# Monitor replication progress
watch -n 1 'cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq ".replication"'

Large lags (>1000 entries) may take several minutes to resolve. This is normal after cluster restarts or large metadata changes.

Segment Rollover Not Happening

Symptoms:

Segment has exceeded WALRUS_MAX_SEGMENT_ENTRIES but no rollover
STATE shows current segment with very high entry count

Diagnostic:

# Check if monitor is running
ps aux | grep walrus
# Look for "Monitor loop started" in logs

# Check current segment size
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.current_segment'

# Check environment variable
echo $WALRUS_MAX_SEGMENT_ENTRIES

Causes & Solutions:

Monitor Loop Not Running

Check logs for “Monitor loop started”:

tail -f /var/log/walrus/node-1.log | grep "Monitor"

If missing, the monitor task may have panicked. Restart the node.

Wrong Node Checking

Only the current segment leader triggers rollover. Verify leadership:

cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.leader_node'

Check the leader node’s logs for rollover attempts.

Raft Consensus Failure

Rollover requires Raft consensus. Check if proposals are failing:

# Look for Raft errors in logs
tail -f /var/log/walrus/node-1.log | grep -i "raft.*error"

If Raft is unhealthy, resolve Raft issues first (see “No Raft Leader” section).

Force Rollover Test

Lower the threshold temporarily to test rollover:

# Set very low threshold
export WALRUS_MAX_SEGMENT_ENTRIES=10

# Restart the leader node
# Write 20 entries and verify rollover happens

Disk Space Issues

Symptoms:

Writes start failing
“No space left on device” errors
WAL files consuming excessive disk space

Check disk usage:

# Check overall disk usage
df -h /path/to/data

# Check per-node usage
du -sh /path/to/data/node_*

# List largest WAL files
find /path/to/data -type f -name "[0-9]*" -exec ls -lh {} \; | sort -k5 -hr | head -20

Solutions:

Implement Retention
Add More Disk Space
Increase Segment Size

Walrus currently has no automatic retention. You must manually delete old segments.

# Identify old sealed segments
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.sealed_segments'

# Manually remove old segment files (be careful!)
# Only delete files for segments you no longer need

# Find segment files by timestamp
ls -lt /path/to/data/node_1/user_data/data_plane/

Future versions will include automatic retention policies. For now, implement external cleanup scripts.

# Resize partition
sudo resize2fs /dev/sdb

# Or mount additional volume
sudo mkdir /data2
sudo mount /dev/sdc /data2

# Restart Walrus with new data directory
cargo run -- --node-id 1 --data-dir /data2

Fewer, larger segments can be more space-efficient:

export WALRUS_MAX_SEGMENT_ENTRIES=5000000  # 5M entries per segment

# Restart nodes with new configuration

This reduces metadata overhead and file count.

io_uring Errors

Symptoms:

Batch operations failing with “Unsupported” errors
Lower than expected throughput
Errors mentioning “io_uring”

Check io_uring support:

# Check kernel version (need 5.1+)
uname -r

# Check if liburing is installed
ldconfig -p | grep liburing

# Check system limits
cat /proc/sys/fs/aio-max-nr

Solutions:

Kernel Too Old

Upgrade kernel to 5.6 or later:

# On Ubuntu/Debian
sudo apt-get update
sudo apt-get install linux-generic-hwe-$(lsb_release -rs)

# Reboot into new kernel
sudo reboot

io_uring is stable in Linux 5.6+. Earlier versions (5.1-5.5) have limited support.

Missing liburing

Install liburing:

# On Ubuntu/Debian
sudo apt-get install liburing-dev

# On RHEL/CentOS
sudo yum install liburing-devel

# Rebuild Walrus
cargo build --release

Fall Back to mmap

If io_uring is not available, disable the FD backend:

export WALRUS_DISABLE_IO_URING=1

# Or in code
use walrus_rust::disable_fd_backend;
disable_fd_backend();

This will reduce batch operation throughput by 3-10x.

High CPU Usage

Symptoms:

Walrus processes consuming >80% CPU
System slow or unresponsive

Diagnostic:

# Check CPU usage
top -H -p $(pgrep walrus)

# Profile with perf
sudo perf top -p $(pgrep walrus)

Common Causes:

Tight Loops
Excessive Logging
Raft Election Storm
Heavy Write Load

Monitor loop too frequent:

echo $WALRUS_MONITOR_CHECK_MS
# If very low (<1000), increase it
export WALRUS_MONITOR_CHECK_MS=10000

Too much debug output:

# Reduce log level
export RUST_LOG=info  # Instead of debug

# Or suppress debug prints
export WALRUS_QUIET=1

Frequent leader elections:

# Check term changes
watch -n 1 'cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics | jq ".current_term"'

If term is increasing rapidly, you have network issues causing split-brain.Solution: Fix network connectivity between nodes.

High write throughput is expected to use CPU. This is normal.Optimize:

Use batch operations instead of single writes
Increase FsyncSchedule interval
Scale horizontally (add more nodes)

Performance Issues

Low Write Throughput

Expected: 100k-1M writes/sec without fsync, 5k-10k writes/sec with fsync If below expected:

Verify io_uring is enabled

# Should NOT be set
echo $WALRUS_DISABLE_IO_URING

# Check logs for io_uring usage
tail -f /var/log/walrus/node-1.log | grep -i "uring"

Use batch operations

// BAD: Single writes in a loop
for entry in entries {
    wal.append_for_topic("logs", entry)?;
}

// GOOD: Batch write
wal.batch_append_for_topic("logs", &entries)?;

Reduce fsync frequency

# Current configuration
cargo run -- --node-id 1 ...

# With less frequent fsync
# (configure in code via FsyncSchedule)

Check disk performance

# Test disk write speed
dd if=/dev/zero of=/path/to/data/test bs=1M count=1000 conv=fdatasync

# Should see >500 MB/s on SSD

If disk is slow, consider:

Using NVMe SSD instead of SATA SSD
Checking I/O scheduler configuration
Ensuring no competing I/O workloads

High Read Latency

Expected: <1ms per read on SSD, <10ms for batch reads If higher:

Check if reading from wrong node: Sealed segments are only on their original leader
Use batch reads: More efficient than single reads
Check disk read performance: Use iostat -x 1 to monitor disk utilization

Data Integrity Issues

Checksum Errors

Symptoms:

InvalidData errors when reading
“Checksum mismatch” in logs

Causes:

Disk corruption
Incomplete writes (power failure without fsync)
Bug in Walrus storage layer

Recovery:

# Identify corrupted segment
# Check logs for file path and offset

# Option 1: Delete corrupted segment (data loss)
rm /path/to/corrupted/segment/file

# Option 2: Restore from backup
cp /backup/segment/file /path/to/data/

# Option 3: Replicate from another node
# Copy segment files from a healthy node

Always use FsyncSchedule::SyncEach or frequent fsync for critical data to prevent corruption on power failure.

Cluster State Issues

Split Brain

Symptoms:

Multiple nodes claim to be leader
Writes succeed on multiple nodes for same segment
Inconsistent data across nodes

Split brain is a critical issue that can cause data inconsistency. Immediate action required.

Diagnostic:

# Check leader on all nodes
for port in 9091 9092 9093; do
  echo "Node on port $port:"
  cargo run --bin walrus-cli -- --addr 127.0.0.1:$port metrics | jq '{id, state, current_leader}'
done

Recovery:

Stop all nodes immediately
Backup data directories
Identify the true leader (highest term, most recent data)
Restart cluster from scratch using bootstrap node

Getting Help

If you’ve exhausted these troubleshooting steps:

GitHub Issues

Report bugs and request features

Discussions

Ask questions and share solutions

Information to Include

When reporting issues, provide:

Walrus version: cargo --version and git commit hash
Cluster configuration: Number of nodes, hardware specs
Environment variables: Output of env | grep WALRUS
Metrics output: METRICS and STATE commands
Logs: Relevant error messages with context
Reproduction steps: How to reproduce the issue

Getting Started

Core Concepts

Standalone Library

Distributed Cluster

Operations

Resources

Troubleshooting

Quick Diagnostics

Common Issues

No Raft Leader Elected

Write Operations Failing

Read Operations Return EMPTY

Follower Replication Lag

Segment Rollover Not Happening

Disk Space Issues

io_uring Errors

High CPU Usage

Performance Issues

Low Write Throughput

High Read Latency

Data Integrity Issues

Checksum Errors

Cluster State Issues

Split Brain

Getting Help

GitHub Issues

Discussions

Information to Include

Next Steps

Configuration

Monitoring

Getting Started

Core Concepts

Standalone Library

Distributed Cluster

Operations

Resources

Documentation Index

​Quick Diagnostics

​Common Issues

​No Raft Leader Elected

​Write Operations Failing

​Read Operations Return EMPTY

​Follower Replication Lag

​Segment Rollover Not Happening

​Disk Space Issues

​io_uring Errors

​High CPU Usage

​Performance Issues

​Low Write Throughput

​High Read Latency

​Data Integrity Issues

​Checksum Errors

​Cluster State Issues

​Split Brain

​Getting Help

GitHub Issues

Discussions

​Information to Include

​Next Steps

Configuration

Monitoring

Quick Diagnostics

Common Issues

No Raft Leader Elected

Write Operations Failing

Read Operations Return EMPTY

Follower Replication Lag

Segment Rollover Not Happening

Disk Space Issues

io_uring Errors

High CPU Usage

Performance Issues

Low Write Throughput

High Read Latency

Data Integrity Issues

Checksum Errors

Cluster State Issues

Split Brain

Getting Help

Information to Include

Next Steps