Common Issues
This guide covers the most frequently encountered issues when deploying and operating DeepTrace, along with step-by-step solutions and preventive measures.
Quick Diagnosis Checklist
Before diving into specific issues, run this quick diagnostic checklist:
# 1. Check all containers are running
sudo docker ps | grep -E "(deeptrace|elasticsearch)"
# 2. Verify network connectivity
curl -f http://localhost:7901/health
curl -f http://localhost:9200/_cluster/health
curl -f http://localhost:7899/status
# 3. Check logs for errors
sudo docker logs deeptrace_server --tail 50
sudo docker logs elasticsearch --tail 50
# 4. Verify eBPF programs are loaded
sudo bpftool prog list | grep deeptrace
# 5. Check system resources
free -h
df -h
Installation Issues
1. Docker Installation Failures
Problem: Docker daemon not running
Symptoms:
Cannot connect to the Docker daemon at unix:///var/run/docker.sock
Solution:
# Start Docker service
sudo systemctl start docker
sudo systemctl enable docker
# Verify Docker is running
sudo systemctl status docker
# Test Docker functionality
sudo docker run hello-world
Problem: Permission denied accessing Docker
Symptoms:
permission denied while trying to connect to the Docker daemon socket
Solution:
# Add user to docker group
sudo usermod -aG docker $USER
# Apply group changes
newgrp docker
# Verify access
docker ps
Problem: Docker registry connection issues
Symptoms:
Error response from daemon: Get https://47.97.67.233:5000/v2/: http: server gave HTTP response to HTTPS client
Solution:
# Configure insecure registry
sudo nano /etc/docker/daemon.json
# Add configuration:
{
"insecure-registries": ["47.97.67.233:5000"]
}
# Restart Docker
sudo systemctl restart docker
2. Compilation Errors
Problem: Missing dependencies
Symptoms:
error: failed to run custom build command for `ebpf-common`
clang: error: no such file or directory: '/usr/include/linux/bpf.h'
Solution:
# Install required packages
sudo apt-get update
sudo apt-get install -y \
build-essential \
clang \
llvm-18 \
llvm-18-dev \
libelf-dev \
libclang-18-dev \
linux-headers-$(uname -r)
# Verify installation
clang-18 --version
ls /usr/include/linux/bpf.h
Problem: BTF (BPF Type Format) issues
Symptoms:
libbpf: failed to find valid kernel BTF
libbpf: Error loading vmlinux BTF: -2
Solution:
# Check BTF availability
ls -la /sys/kernel/btf/vmlinux
# If missing, check kernel config
zgrep CONFIG_DEBUG_INFO_BTF /proc/config.gz
# For Ubuntu, install BTF-enabled kernel
sudo apt-get install linux-image-generic-hwe-22.04
# Reboot if kernel was updated
sudo reboot
Problem: Rust compilation errors
Symptoms:
error: linking with `cc` failed: exit status: 1
/usr/bin/ld: cannot find -lbpf: No such file or directory
Solution:
# Install libbpf development libraries
sudo apt-get install libbpf-dev
# Or compile libbpf from source
git clone https://github.com/libbpf/libbpf.git
cd libbpf/src
make
sudo make install
sudo ldconfig
Runtime Issues
3. Agent Connection Problems
Problem: Agent fails to start
Symptoms:
curl http://localhost:7899/status
# curl: (7) Failed to connect to localhost port 7899: Connection refused
Diagnosis:
# Check if agent process is running
ps aux | grep deeptrace
# Check agent logs
sudo docker exec -it deeptrace_server cat /var/log/deeptrace/agent.log
# Verify eBPF programs
sudo bpftool prog list | grep deeptrace
Solution:
# Restart agent
sudo docker exec -it deeptrace_server python -m cli.src.cmd agent stop
sudo docker exec -it deeptrace_server python -m cli.src.cmd agent run
# Check for permission issues
sudo dmesg | grep -i bpf
# Verify kernel version compatibility
uname -r
# Should be 6.8.0 or later
Problem: Agent loses connection to server
Symptoms:
- Agent status shows "disconnected"
- No new spans appearing in Elasticsearch
- Network timeouts in logs
Diagnosis:
# Test network connectivity
telnet localhost 7901
# Check server status
curl http://localhost:7901/health
# Monitor network traffic
sudo netstat -tuln | grep 7901
Solution:
# Check firewall settings
sudo ufw status
sudo iptables -L
# Verify server configuration
sudo docker exec -it deeptrace_server cat /app/config/config.toml
# Restart networking components
sudo docker restart deeptrace_server
4. Data Collection Issues
Problem: No spans being collected
Symptoms:
- Empty Elasticsearch indices
- Zero span count in dashboard
- No eBPF events in logs
Diagnosis:
# Check monitored processes
sudo docker exec -it deeptrace_server python -m cli.src.cmd agent list-processes
# Verify eBPF program attachment
sudo bpftool prog show | grep deeptrace
# Check system call activity
sudo strace -e trace=network -p $(pgrep your-app) -c
Solution:
# Add processes to monitoring
sudo docker exec -it deeptrace_server python -m cli.src.cmd agent add-process --name nginx
# Verify process filtering configuration
sudo docker exec -it deeptrace_server python -m cli.src.cmd config show agents.trace.pids
# Restart with debug logging
RUST_LOG=debug sudo docker exec -it deeptrace_server python -m cli.src.cmd agent run
Problem: Incomplete span data
Symptoms:
- Spans missing payload data
- Incomplete network information
- Missing timing information
Diagnosis:
# Check payload capture settings
curl http://localhost:7899/config | jq '.capture'
# Monitor eBPF map usage
sudo bpftool map show | grep deeptrace
# Check for buffer overflows
dmesg | grep -i "ring buffer"
Solution:
# Increase buffer sizes
sudo docker exec -it deeptrace_server python -m cli.src.cmd config update \
--key "agents.sender.mem_buffer_size" --value 64
# Enable payload compression
sudo docker exec -it deeptrace_server python -m cli.src.cmd config update \
--key "agents.capture.enable_compression" --value true
# Adjust payload limits
sudo docker exec -it deeptrace_server python -m cli.src.cmd config update \
--key "agents.capture.max_payload_size" --value 2048
5. Performance Issues
Problem: High CPU usage
Symptoms:
- System CPU usage > 80%
- Application performance degradation
- High eBPF program execution time
Diagnosis:
# Monitor CPU usage by process
htop
top -p $(pgrep deeptrace)
# Check eBPF program performance
sudo bpftool prog show | grep run_time_ns
# Profile application performance
perf top -p $(pgrep your-app)
Solution:
# Implement sampling
sudo docker exec -it deeptrace_server python -m cli.src.cmd config update \
--key "agents.trace.sampling_rate" --value 0.1
# Reduce payload capture
sudo docker exec -it deeptrace_server python -m cli.src.cmd config update \
--key "agents.capture.max_payload_size" --value 512
# Optimize process filtering
sudo docker exec -it deeptrace_server python -m cli.src.cmd agent remove-process --name unnecessary-process
Problem: High memory usage
Symptoms:
- System memory usage > 90%
- OOM (Out of Memory) errors
- Swap usage increasing
Diagnosis:
# Check memory usage by component
free -h
sudo docker stats
# Monitor Elasticsearch memory
curl http://localhost:9200/_nodes/stats/jvm
# Check for memory leaks
valgrind --tool=massif --pid=$(pgrep deeptrace)
Solution:
# Reduce Elasticsearch heap size
sudo docker exec -it elasticsearch bash -c 'export ES_JAVA_OPTS="-Xms1g -Xmx2g"'
# Implement data retention
curl -X PUT "localhost:9200/_ilm/policy/deeptrace-policy" -H 'Content-Type: application/json' -d'
{
"policy": {
"phases": {
"delete": {
"min_age": "7d"
}
}
}
}'
# Clean old indices
curl -X DELETE "localhost:9200/traces-$(date -d '7 days ago' +%Y.%m.%d)"
6. Elasticsearch Issues
Problem: Elasticsearch cluster health is red
Symptoms:
curl http://localhost:9200/_cluster/health
# {"status":"red","timed_out":false}
Diagnosis:
# Check cluster status details
curl http://localhost:9200/_cluster/health?pretty
# Check node status
curl http://localhost:9200/_cat/nodes?v
# Check shard allocation
curl http://localhost:9200/_cat/shards?v
Solution:
# Restart Elasticsearch
sudo docker restart elasticsearch
# Check disk space
df -h
# Reallocate unassigned shards
curl -X POST "localhost:9200/_cluster/reroute?retry_failed=true"
# If disk space is low, clean old data
curl -X DELETE "localhost:9200/traces-*" -H 'Content-Type: application/json' -d'
{
"query": {
"range": {
"@timestamp": {
"lt": "now-7d"
}
}
}
}'
Problem: Slow query performance
Symptoms:
- Dashboard loading slowly
- Query timeouts
- High Elasticsearch CPU usage
Diagnosis:
# Check slow queries
curl http://localhost:9200/_nodes/stats/indices/search
# Monitor query performance
curl http://localhost:9200/_cat/thread_pool/search?v
# Check index statistics
curl http://localhost:9200/_cat/indices?v&s=store.size:desc
Solution:
# Optimize indices
curl -X POST "localhost:9200/traces-*/_forcemerge?max_num_segments=1"
# Add more replicas for read performance
curl -X PUT "localhost:9200/traces-*/_settings" -H 'Content-Type: application/json' -d'
{
"index": {
"number_of_replicas": 1
}
}'
# Create index templates with optimized mappings
curl -X PUT "localhost:9200/_index_template/traces" -H 'Content-Type: application/json' -d'
{
"index_patterns": ["traces-*"],
"template": {
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0,
"refresh_interval": "30s"
}
}
}'
Correlation and Assembly Issues
7. Poor Correlation Results
Problem: Low correlation accuracy
Symptoms:
- Traces with missing spans
- Incorrect parent-child relationships
- Fragmented traces
Diagnosis:
# Check correlation statistics
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso stats
# Analyze correlation parameters
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso config show
# Review sample traces
curl "http://localhost:9200/traces/_search?size=10&pretty"
Solution:
# Adjust correlation window
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso config --window 2000
# Lower similarity threshold
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso config --threshold 0.6
# Try different algorithm
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso algo fifo
# Enable debug mode
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso config --debug
Problem: Correlation timeouts
Symptoms:
- Correlation process hangs
- High CPU usage during correlation
- Memory exhaustion
Diagnosis:
# Monitor correlation process
ps aux | grep correlation
htop -p $(pgrep correlation)
# Check memory usage
free -h
sudo docker stats deeptrace_server
Solution:
# Increase timeout values
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso config --timeout 300
# Process in smaller batches
sudo docker exec -it deeptrace_server python -m cli.src.cmd asso config --batch-size 1000
# Add more memory to container
sudo docker update --memory 4g deeptrace_server
Network and Connectivity Issues
8. Port Conflicts
Problem: Port already in use
Symptoms:
Error starting userland proxy: listen tcp 0.0.0.0:7901: bind: address already in use
Diagnosis:
# Check what's using the port
sudo netstat -tuln | grep 7901
sudo lsof -i :7901
# Find the process
sudo fuser 7901/tcp
Solution:
# Kill conflicting process
sudo fuser -k 7901/tcp
# Or change DeepTrace port
sudo docker exec -it deeptrace_server python -m cli.src.cmd config update \
--key "server.port" --value 7902
# Restart with new configuration
sudo docker restart deeptrace_server
9. SSL/TLS Issues
Problem: Certificate validation errors
Symptoms:
SSL certificate problem: self signed certificate
Solution:
# For development, disable SSL verification
curl -k https://localhost:7901/health
# For production, install proper certificates
sudo docker exec -it deeptrace_server python -m cli.src.cmd cert install \
--cert /path/to/cert.pem \
--key /path/to/key.pem
Monitoring and Alerting
10. Setting Up Health Checks
Create monitoring scripts to detect issues early:
#!/bin/bash
# health-check.sh
# Check all services
services=("deeptrace_server:7901" "elasticsearch:9200" "agent:7899")
for service in "${services[@]}"; do
name=$(echo $service | cut -d: -f1)
port=$(echo $service | cut -d: -f2)
if ! curl -f -s http://localhost:$port/health > /dev/null; then
echo "ALERT: $name is not responding on port $port"
# Send alert (email, Slack, etc.)
fi
done
# Check disk space
disk_usage=$(df / | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $disk_usage -gt 80 ]; then
echo "ALERT: Disk usage is ${disk_usage}%"
fi
# Check memory usage
mem_usage=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}')
if [ $mem_usage -gt 80 ]; then
echo "ALERT: Memory usage is ${mem_usage}%"
fi
11. Log Analysis
Set up centralized logging for better troubleshooting:
# Collect all logs
sudo docker logs deeptrace_server > deeptrace-server.log 2>&1
sudo docker logs elasticsearch > elasticsearch.log 2>&1
dmesg | grep -i bpf > kernel-bpf.log
# Analyze error patterns
grep -i error *.log
grep -i "failed\|timeout\|exception" *.log
# Monitor real-time logs
sudo docker logs -f deeptrace_server | grep -E "(ERROR|WARN|FATAL)"
Prevention Strategies
1. Regular Maintenance
#!/bin/bash
# maintenance.sh - Run weekly
# Clean old data
curl -X DELETE "localhost:9200/traces-$(date -d '30 days ago' +%Y.%m.%d)"
# Optimize indices
curl -X POST "localhost:9200/traces-*/_forcemerge?max_num_segments=1"
# Update system packages
sudo apt-get update && sudo apt-get upgrade -y
# Restart services
sudo docker restart deeptrace_server elasticsearch
2. Capacity Planning
Monitor these metrics regularly:
- CPU Usage: Keep below 70% average
- Memory Usage: Keep below 80% average
- Disk Usage: Keep below 75% average
- Network Bandwidth: Monitor for saturation
- Elasticsearch Heap: Keep below 75% of allocated memory
3. Backup Strategy
#!/bin/bash
# backup.sh - Run daily
# Backup Elasticsearch data
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d)" -H 'Content-Type: application/json' -d'
{
"indices": "traces-*",
"ignore_unavailable": true,
"include_global_state": false
}'
# Backup configuration
sudo docker exec deeptrace_server tar -czf /backup/config-$(date +%Y%m%d).tar.gz /app/config/