Debugging Guide
This comprehensive debugging guide helps you diagnose and resolve issues with DeepTrace components. It covers systematic troubleshooting approaches, diagnostic tools, and common problem resolution strategies.
Debugging Methodology
1. Problem Identification
Start with these key questions:
- What is the expected behavior?
- What is the actual behavior?
- When did the problem start?
- What changed recently?
- Is the problem consistent or intermittent?
2. Information Gathering
Collect relevant information systematically:
# System information
uname -a
cat /etc/os-release
free -h
df -h
# DeepTrace version
deeptrace-agent --version
deeptrace-server --version
# Process status
ps aux | grep deeptrace
systemctl status deeptrace-agent
systemctl status deeptrace-server
3. Log Analysis
Enable comprehensive logging:
# agent.toml
[logging]
level = "debug"
format = "json"
output = "file"
file_path = "/var/log/deeptrace/agent-debug.log"
# server.toml
[logging]
level = "debug"
format = "json"
output = "file"
file_path = "/var/log/deeptrace/server-debug.log"
Component-Specific Debugging
Agent Debugging
eBPF Program Issues
Check eBPF Support:
# Verify kernel version
uname -r
# Check eBPF filesystem
ls -la /sys/fs/bpf/
# Verify BPF capabilities
grep CONFIG_BPF /boot/config-$(uname -r)
Debug Program Loading:
# Enable eBPF debug logging
echo 1 > /proc/sys/kernel/bpf_stats_enabled
# Check loaded programs
sudo bpftool prog list | grep deeptrace
# Monitor kernel messages
sudo dmesg -w | grep bpf
# Check program verification logs
journalctl -f | grep bpf
Common eBPF Errors:
-
BTF_KIND:0 Error
# Check BTF availability ls -la /sys/kernel/btf/vmlinux # Verify BTF format bpftool btf dump file /sys/kernel/btf/vmlinux | head -20 # Fallback to non-CO-RE mode export DEEPTRACE_EBPF_ENABLE_CO_RE=false -
Permission Denied
# Check capabilities getcap /usr/bin/deeptrace-agent # Add required capabilities sudo setcap cap_sys_admin,cap_bpf+ep /usr/bin/deeptrace-agent # Or run with sudo (not recommended for production) sudo deeptrace-agent --config agent.toml -
Program Too Large
# Check program size limits cat /proc/sys/kernel/bpf_jit_limit # Increase limit if needed echo 1000000000 | sudo tee /proc/sys/kernel/bpf_jit_limit
Span Collection Issues
Debug Span Collection:
# Enable span collection debugging
curl -X POST http://localhost:7899/config \
-H "Content-Type: application/json" \
-d '{"logging": {"level": "trace"}}'
# Monitor span collection rate
watch -n 1 'curl -s http://localhost:7899/status | jq .collection.spans_per_second'
# Check process filtering
curl http://localhost:7899/processes | jq '.processes[] | select(.status == "monitored")'
Span Collection Troubleshooting:
#!/usr/bin/env python3
# debug_span_collection.py
import requests
import time
import json
def debug_span_collection():
agent_url = "http://localhost:7899"
# Get agent status
status = requests.get(f"{agent_url}/status").json()
print(f"Agent Status: {status['agent']['status']}")
print(f"eBPF Programs: {status['ebpf']['programs_loaded']}")
print(f"Spans Collected: {status['collection']['spans_collected']}")
# Check monitored processes
processes = requests.get(f"{agent_url}/processes").json()
monitored = [p for p in processes['processes'] if p['status'] == 'monitored']
print(f"Monitored Processes: {len(monitored)}")
for proc in monitored[:5]: # Show first 5
print(f" PID {proc['pid']}: {proc['name']} ({proc['spans_collected']} spans)")
# Monitor span rate
print("\nMonitoring span collection rate...")
prev_count = status['collection']['spans_collected']
time.sleep(10)
new_status = requests.get(f"{agent_url}/status").json()
new_count = new_status['collection']['spans_collected']
rate = (new_count - prev_count) / 10
print(f"Span Rate: {rate:.2f} spans/second")
if rate == 0:
print("WARNING: No spans being collected!")
print("Check:")
print("- eBPF programs are loaded")
print("- Processes are being monitored")
print("- Network activity is occurring")
if __name__ == "__main__":
debug_span_collection()
Network Communication Issues
Debug Server Communication:
# Test server connectivity
curl -v http://localhost:7901/health
# Check agent-server communication
tcpdump -i any -n port 7901
# Monitor failed requests
curl http://localhost:7899/metrics | grep failed_requests
# Check retry queue
curl http://localhost:7899/status | jq .sender.retry_queue_size
Server Debugging
Elasticsearch Issues
Debug Elasticsearch Connection:
# Check Elasticsearch health
curl http://localhost:9200/_cluster/health?pretty
# Verify indices
curl http://localhost:9200/_cat/indices/deeptrace*
# Check index mappings
curl http://localhost:9200/deeptrace-spans/_mapping?pretty
# Monitor indexing performance
curl http://localhost:9200/_cat/thread_pool/write?v
Elasticsearch Troubleshooting Script:
#!/usr/bin/env python3
# debug_elasticsearch.py
import requests
import json
from datetime import datetime, timedelta
def debug_elasticsearch():
es_url = "http://localhost:9200"
try:
# Check cluster health
health = requests.get(f"{es_url}/_cluster/health").json()
print(f"Cluster Status: {health['status']}")
print(f"Active Shards: {health['active_shards']}")
print(f"Unassigned Shards: {health['unassigned_shards']}")
# Check indices
indices = requests.get(f"{es_url}/_cat/indices/deeptrace*?format=json").json()
print(f"\nDeepTrace Indices: {len(indices)}")
for idx in indices:
print(f" {idx['index']}: {idx['docs.count']} docs, {idx['store.size']}")
# Check recent documents
query = {
"query": {
"range": {
"timestamp": {
"gte": "now-1h"
}
}
},
"size": 0
}
result = requests.get(
f"{es_url}/deeptrace-spans/_search",
json=query
).json()
recent_docs = result['hits']['total']['value']
print(f"\nRecent Documents (1h): {recent_docs}")
if recent_docs == 0:
print("WARNING: No recent documents found!")
print("Check:")
print("- Agent is sending data")
print("- Index template is correct")
print("- No indexing errors")
except Exception as e:
print(f"ERROR: Cannot connect to Elasticsearch: {e}")
print("Check:")
print("- Elasticsearch is running")
print("- Network connectivity")
print("- Authentication credentials")
if __name__ == "__main__":
debug_elasticsearch()
Correlation Engine Issues
Debug Correlation Process:
# Check correlation status
curl http://localhost:7901/status | jq .correlation
# Monitor correlation jobs
curl http://localhost:7901/correlation/jobs
# Check algorithm performance
curl http://localhost:7901/analytics/services | jq '.services[] | {name, request_count, error_rate}'
Correlation Debugging:
#!/usr/bin/env python3
# debug_correlation.py
import requests
import json
from datetime import datetime, timedelta
def debug_correlation():
server_url = "http://localhost:7901"
# Get server status
status = requests.get(f"{server_url}/status").json()
correlation = status['correlation']
print(f"Correlation Algorithm: {correlation['algorithm']}")
print(f"Spans Processed: {correlation['spans_processed']}")
print(f"Traces Generated: {correlation['traces_generated']}")
print(f"Correlation Rate: {correlation['correlation_rate']:.2f}%")
# Check for recent traces
traces = requests.get(f"{server_url}/traces?limit=10").json()
print(f"\nRecent Traces: {len(traces['traces'])}")
if len(traces['traces']) == 0:
print("WARNING: No traces found!")
print("Check:")
print("- Spans are being received")
print("- Correlation is running")
print("- Algorithm parameters")
# Analyze trace quality
for trace in traces['traces'][:3]:
print(f"\nTrace {trace['trace_id']}:")
print(f" Spans: {trace['span_count']}")
print(f" Services: {trace['service_count']}")
print(f" Duration: {trace['duration']}ms")
print(f" Has Errors: {trace['has_errors']}")
if __name__ == "__main__":
debug_correlation()
Advanced Debugging Techniques
Performance Profiling
CPU Profiling
# Profile agent CPU usage
perf record -g -p $(pgrep deeptrace-agent) -- sleep 30
perf report
# Profile server CPU usage
perf record -g -p $(pgrep deeptrace-server) -- sleep 30
perf report
# Generate flame graphs
git clone https://github.com/brendangregg/FlameGraph
perf record -g -p $(pgrep deeptrace-agent) -- sleep 30
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > agent-flamegraph.svg
Memory Profiling
# Profile memory usage with Valgrind
valgrind --tool=massif --massif-out-file=agent.massif ./deeptrace-agent --config agent.toml
ms_print agent.massif > agent-memory-profile.txt
# Monitor memory usage over time
while true; do
echo "$(date): $(ps -p $(pgrep deeptrace-agent) -o rss= | awk '{print $1/1024 " MB"}')"
sleep 60
done
Network Profiling
# Monitor network traffic
sudo tcpdump -i any -w deeptrace-traffic.pcap host localhost and port 7901
# Analyze with Wireshark
wireshark deeptrace-traffic.pcap
# Monitor bandwidth usage
iftop -i any -f "port 7901"
eBPF Debugging
BPF Program Analysis
# Dump loaded programs
sudo bpftool prog dump xlated id $(sudo bpftool prog list | grep deeptrace | awk '{print $1}' | head -1)
# Show program statistics
sudo bpftool prog show --json | jq '.[] | select(.name | contains("deeptrace"))'
# Monitor map usage
sudo bpftool map show --json | jq '.[] | select(.name | contains("deeptrace"))'
Custom eBPF Debugging
// debug_ebpf.c - Custom debugging program
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 256 * 1024);
} debug_events SEC(".maps");
struct debug_event {
__u64 timestamp;
__u32 pid;
__u32 event_type;
char comm[16];
};
SEC("kprobe/tcp_sendmsg")
int debug_tcp_sendmsg(struct pt_regs *ctx) {
struct debug_event *event;
event = bpf_ringbuf_reserve(&debug_events, sizeof(*event), 0);
if (!event)
return 0;
event->timestamp = bpf_ktime_get_ns();
event->pid = bpf_get_current_pid_tgid() >> 32;
event->event_type = 1; // TCP_SENDMSG
bpf_get_current_comm(&event->comm, sizeof(event->comm));
bpf_ringbuf_submit(event, 0);
return 0;
}
char LICENSE[] SEC("license") = "GPL";
Log Analysis Tools
Structured Log Analysis
#!/usr/bin/env python3
# analyze_logs.py
import json
import sys
from collections import defaultdict, Counter
from datetime import datetime
def analyze_logs(log_file):
errors = []
warnings = []
events = defaultdict(int)
with open(log_file, 'r') as f:
for line in f:
try:
log_entry = json.loads(line.strip())
level = log_entry.get('level', '').upper()
message = log_entry.get('message', '')
timestamp = log_entry.get('timestamp', '')
if level == 'ERROR':
errors.append((timestamp, message))
elif level == 'WARN':
warnings.append((timestamp, message))
# Count events by type
if 'event_type' in log_entry:
events[log_entry['event_type']] += 1
except json.JSONDecodeError:
continue
print(f"Log Analysis Results:")
print(f"Errors: {len(errors)}")
print(f"Warnings: {len(warnings)}")
print(f"Event Types: {dict(events)}")
if errors:
print("\nRecent Errors:")
for timestamp, message in errors[-5:]:
print(f" {timestamp}: {message}")
if warnings:
print("\nRecent Warnings:")
for timestamp, message in warnings[-5:]:
print(f" {timestamp}: {message}")
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: python3 analyze_logs.py <log_file>")
sys.exit(1)
analyze_logs(sys.argv[1])
Real-time Log Monitoring
#!/bin/bash
# monitor_logs.sh
LOG_FILE="/var/log/deeptrace/agent.log"
ERROR_COUNT=0
WARNING_COUNT=0
tail -f "$LOG_FILE" | while read line; do
if echo "$line" | grep -q '"level":"ERROR"'; then
ERROR_COUNT=$((ERROR_COUNT + 1))
echo "ERROR [$ERROR_COUNT]: $line" | jq -r '.message'
# Alert on high error rate
if [ $ERROR_COUNT -gt 10 ]; then
echo "ALERT: High error rate detected!"
# Send notification
curl -X POST "$SLACK_WEBHOOK" -d "{\"text\":\"DeepTrace high error rate: $ERROR_COUNT errors\"}"
fi
elif echo "$line" | grep -q '"level":"WARN"'; then
WARNING_COUNT=$((WARNING_COUNT + 1))
echo "WARNING [$WARNING_COUNT]: $line" | jq -r '.message'
fi
done
Debugging Checklists
Agent Not Collecting Spans
- eBPF programs loaded successfully
- Processes are being monitored
- Network activity is occurring
- Ring buffer is not full
- Process filters are correct
- Protocol filters are appropriate
- Sufficient privileges (CAP_BPF or root)
- Kernel version compatibility
Server Not Receiving Spans
- Agent can connect to server
- Server is listening on correct port
- Network connectivity between agent and server
- Authentication is configured correctly
- Server has sufficient resources
- Elasticsearch is accessible
- No firewall blocking traffic
Correlation Not Working
- Spans are being received by server
- Correlation algorithm is running
- Algorithm parameters are appropriate
- Sufficient spans for correlation
- Time synchronization between hosts
- Elasticsearch indices are healthy
- No correlation engine errors
Poor Performance
- Resource usage within limits
- eBPF programs are optimized
- Batch sizes are appropriate
- Network latency is acceptable
- Elasticsearch is tuned properly
- Sampling is configured if needed
- No memory leaks detected
Emergency Procedures
Service Recovery
#!/bin/bash
# emergency_recovery.sh
echo "DeepTrace Emergency Recovery"
echo "=========================="
# Stop services
echo "Stopping services..."
systemctl stop deeptrace-agent
systemctl stop deeptrace-server
# Clear problematic state
echo "Clearing state..."
rm -f /tmp/deeptrace-agent.pid
rm -f /tmp/deeptrace-server.pid
rm -rf /tmp/deeptrace-buffers/*
# Reset eBPF state
echo "Resetting eBPF state..."
for prog in $(bpftool prog list | grep deeptrace | awk '{print $1}'); do
bpftool prog detach id $prog
done
# Restart with minimal configuration
echo "Starting with minimal config..."
cp /etc/deeptrace/agent.toml /etc/deeptrace/agent.toml.backup
cp /etc/deeptrace/minimal-agent.toml /etc/deeptrace/agent.toml
systemctl start deeptrace-server
sleep 5
systemctl start deeptrace-agent
echo "Recovery complete. Check status with:"
echo " systemctl status deeptrace-agent"
echo " systemctl status deeptrace-server"
Data Recovery
#!/bin/bash
# recover_data.sh
BACKUP_DIR="/backup/deeptrace"
ES_URL="http://localhost:9200"
echo "DeepTrace Data Recovery"
echo "====================="
# Check Elasticsearch status
if ! curl -s "$ES_URL/_cluster/health" > /dev/null; then
echo "ERROR: Elasticsearch not accessible"
exit 1
fi
# List available backups
echo "Available backups:"
ls -la "$BACKUP_DIR"
read -p "Enter backup date (YYYY-MM-DD): " BACKUP_DATE
if [ -f "$BACKUP_DIR/deeptrace-$BACKUP_DATE.json" ]; then
echo "Restoring data from $BACKUP_DATE..."
# Restore indices
curl -X POST "$ES_URL/_bulk" \
-H "Content-Type: application/json" \
--data-binary "@$BACKUP_DIR/deeptrace-$BACKUP_DATE.json"
echo "Data recovery complete"
else
echo "ERROR: Backup file not found"
exit 1
fi
This debugging guide provides comprehensive tools and procedures for diagnosing and resolving DeepTrace issues. Use it systematically to identify root causes and implement effective solutions.