When encountering issues, start with these diagnostic commands:
# Check cluster healthcargo run --bin walrus-cli -- --addr 127.0.0.1:9091 metrics# Check topic statecargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs# Test write operationcargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put test "diagnostic write"# Test read operationcargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get test
# Check if topic was createdcargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs# Verify writes were successfulcargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "test write"# Should return: OK# Try reading againcargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs
Causes & Solutions:
Cursor Position
Wrong Node
Topic Doesn't Exist
Cause: Read cursor is at the end of available dataCheck:
# Get topic statecargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs
If cursor has already consumed all entries, write new data:
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "new entry"cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs
Cause: Reading from a node that doesn’t have the segment dataSolution: Read from the segment leader node
# Check segment leaderscargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.segment_leaders'# Connect to the node that owns segment 1cargo run --bin walrus-cli -- --addr 127.0.0.1:9092 get logs
Cause: Topic was never createdSolution: Register the topic first
cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 register logscargo run --bin walrus-cli -- --addr 127.0.0.1:9091 put logs "first entry"cargo run --bin walrus-cli -- --addr 127.0.0.1:9091 get logs
Segment has exceeded WALRUS_MAX_SEGMENT_ENTRIES but no rollover
STATE shows current segment with very high entry count
Diagnostic:
# Check if monitor is runningps aux | grep walrus# Look for "Monitor loop started" in logs# Check current segment sizecargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.current_segment'# Check environment variableecho $WALRUS_MAX_SEGMENT_ENTRIES
# Check overall disk usagedf -h /path/to/data# Check per-node usagedu -sh /path/to/data/node_*# List largest WAL filesfind /path/to/data -type f -name "[0-9]*" -exec ls -lh {} \; | sort -k5 -hr | head -20
Solutions:
Implement Retention
Add More Disk Space
Increase Segment Size
Walrus currently has no automatic retention. You must manually delete old segments.
# Identify old sealed segmentscargo run --bin walrus-cli -- --addr 127.0.0.1:9091 state logs | jq '.sealed_segments'# Manually remove old segment files (be careful!)# Only delete files for segments you no longer need# Find segment files by timestampls -lt /path/to/data/node_1/user_data/data_plane/
Future versions will include automatic retention policies. For now, implement external cleanup scripts.
# Resize partitionsudo resize2fs /dev/sdb# Or mount additional volumesudo mkdir /data2sudo mount /dev/sdc /data2# Restart Walrus with new data directorycargo run -- --node-id 1 --data-dir /data2
Fewer, larger segments can be more space-efficient:
export WALRUS_MAX_SEGMENT_ENTRIES=5000000 # 5M entries per segment# Restart nodes with new configuration
Expected: 100k-1M writes/sec without fsync, 5k-10k writes/sec with fsyncIf below expected:
1
Verify io_uring is enabled
# Should NOT be setecho $WALRUS_DISABLE_IO_URING# Check logs for io_uring usagetail -f /var/log/walrus/node-1.log | grep -i "uring"
2
Use batch operations
// BAD: Single writes in a loopfor entry in entries { wal.append_for_topic("logs", entry)?;}// GOOD: Batch writewal.batch_append_for_topic("logs", &entries)?;
3
Reduce fsync frequency
# Current configurationcargo run -- --node-id 1 ...# With less frequent fsync# (configure in code via FsyncSchedule)
4
Check disk performance
# Test disk write speeddd if=/dev/zero of=/path/to/data/test bs=1M count=1000 conv=fdatasync# Should see >500 MB/s on SSD
Split brain is a critical issue that can cause data inconsistency. Immediate action required.
Diagnostic:
# Check leader on all nodesfor port in 9091 9092 9093; do echo "Node on port $port:" cargo run --bin walrus-cli -- --addr 127.0.0.1:$port metrics | jq '{id, state, current_leader}'done
Recovery:
Stop all nodes immediately
Backup data directories
Identify the true leader (highest term, most recent data)