Server Architecture
The DeepTrace Server is a Python-based distributed system responsible for managing agents, processing correlated spans from Elasticsearch, performing trace assembly, and providing management interfaces. This document provides a detailed overview of the server's architecture, components, and operational principles based on the actual implementation.
Overview
The DeepTrace Server operates as a centralized control and processing system that:
- Manages Agent Lifecycle: Deploys, configures, and monitors distributed agents
- Processes Correlated Span Data: Retrieves correlated spans from Elasticsearch for assembly
- Performs Trace Assembly: Assembles correlated spans into complete distributed traces
- Provides Management Interface: Offers APIs and tools for system administration
Architecture Diagram
graph TB
subgraph "DeepTrace Server"
subgraph "Agent Management"
AGENT_MGR[Agent Manager]
SSH_CLIENT[SSH Client]
DEPLOY[Deployment Controller]
end
subgraph "Data Processing"
SPAN_POLLER[Span Poller]
ASSEMBLER[Trace Assembler]
end
subgraph "Storage Interface"
ES_CLIENT[Elasticsearch Client]
DB_UTILS[Database Utils]
end
subgraph "Configuration"
CONFIG_PARSER[Config Parser]
TOML_CONFIG[TOML Configuration]
end
end
subgraph "External Systems"
AGENTS[Remote Agents]
ES[(Elasticsearch)]
SSH[SSH Hosts]
end
AGENT_MGR --> SSH_CLIENT
SSH_CLIENT --> SSH
SSH --> AGENTS
DEPLOY --> AGENTS
SPAN_POLLER --> ES_CLIENT
ES_CLIENT --> ES
SPAN_POLLER --> ASSEMBLER
ASSEMBLER --> ES_CLIENT
CONFIG_PARSER --> TOML_CONFIG
CONFIG_PARSER --> AGENT_MGR
Core Components
1. Agent Management System
The server provides comprehensive agent lifecycle management:
Agent Class
- Purpose: Represents and manages individual agent instances
- Key Features:
- SSH-based remote command execution
- Configuration synchronization
- Code deployment and installation
- Process management (start/stop/restart)
- Health monitoring and status tracking
Agent Operations
class Agent:
def __init__(self, agent_config, elastic_config, server_config):
# SSH connection management
self.ssh_client = None
self.host_ip = agent_config['agent_info']['host_ip']
self.ssh_port = agent_config['agent_info']['ssh_port']
self.user_name = agent_config['agent_info']['user_name']
self.host_password = agent_config['agent_info']['host_password']
def clone_code(self):
# Git clone from repository
repo_url = 'https://gitee.com/gytlll/DeepTrace.git'
def install(self):
# Run installation script
command = "bash scripts/install_agent.sh"
def sync_config(self):
# Generate and deploy TOML configuration
def run(self):
# Start agent process
command = "bash scripts/run_agent.sh"
def stop(self):
# Stop agent process
command = "bash scripts/stop_agent.sh"
Configuration Management
- Dynamic Configuration: Generates agent-specific TOML configurations
- Hot Reload: Supports runtime configuration updates via API
- Template System: Uses server configuration to generate agent configs
- Validation: Ensures configuration consistency across agents
2. Data Processing Pipeline
The server implements a sophisticated data processing pipeline:
Span Polling
- Purpose: Continuously retrieves new spans from Elasticsearch
- Implementation:
poll_agents_new_spans()function - Features:
- Multi-agent span collection
- Configurable polling intervals
- Queue-based processing
- Error handling and retry logic
Trace Assembly Engine
def span2trace(correlated_spans):
# Step 1: Process correlated spans
spans = process_correlated_spans(correlated_spans)
# Step 2: Span merging
span_list = span_merge(spans)
# Step 3: Trace assembly
trace_num = assemble_trace_from_spans(span_list, 'traces')
Processing Components
- Span Processing: Processes correlated spans from agents
- Span Merge: Consolidates related spans
- Trace Assembler: Builds complete trace structures from correlated spans
3. Storage Interface
The server provides comprehensive Elasticsearch integration:
Database Utilities
- Connection Management: Elasticsearch client initialization
- Index Management: Automatic index creation and management
- Bulk Operations: Efficient batch data operations
- Query Interface: Structured query building and execution
Key Functions
def es_write_agent_config(agent_config, elastic_config, server_config):
# Store agent configuration in Elasticsearch
def poll_agents_new_spans(agents, queue, interval):
# Retrieve new spans from multiple agents
def check_db():
# Verify database connectivity and health
4. Configuration System
The server uses a TOML-based configuration system:
Configuration Structure
[elastic]
elastic_password = "password" # Elasticsearch authentication
[server]
ip = "server_ip" # Server external IP
[[agents]]
[agents.agent_info]
agent_name = "agent1" # Unique agent identifier
user_name = "username" # SSH username
host_ip = "agent_ip" # Agent host IP
ssh_port = 22 # SSH port
host_password = "password" # SSH password
Configuration Features
- Multi-Agent Support: Array-based agent configuration
- Environment Specific: Separate configs for different environments
- Validation: Schema validation and error handling
- Dynamic Loading: Runtime configuration reloading
Data Flow Architecture
1. Agent Management Flow
Configuration → Agent Creation → SSH Connection → Remote Operations
2. Span Processing Flow
Elasticsearch → Span Polling → Queue → Assembly → Storage
3. Deployment Flow
Config Parsing → Agent Initialization → Code Clone → Installation → Configuration Sync → Agent Start
Operational Modes
The server supports different operational modes:
Automatic Mode
- Default Operation: Continuous correlated span processing
- Background Processing: Automated trace assembly
- Health Monitoring: Continuous agent health checks
Manual Mode
- Interactive Control: Manual agent management
- Debug Mode: Enhanced logging and debugging
- Maintenance Mode: System maintenance operations
Management Interface
Command Line Interface
The server provides various management utilities:
Agent Management
def install_agents(agents):
# Parallel agent installation
def start_agents(agents):
# Start all configured agents
def stop_agents(agents):
# Stop all running agents
def update_agent_config(agents):
# Hot reload agent configurations
def test_agents(agents):
# Test agent connectivity and health
Monitoring Functions
- Health Checks: Agent connectivity and status monitoring
- Performance Metrics: Processing statistics and performance data
- Error Tracking: Comprehensive error logging and tracking
- Resource Monitoring: System resource usage tracking
Deployment Architecture
Server Requirements
- Python Runtime: Python 3.x with required dependencies
- Network Access: SSH access to agent hosts
- Elasticsearch: Connection to Elasticsearch cluster
- Configuration: Proper TOML configuration files
Agent Deployment Process
- Code Distribution: Git clone from central repository
- Installation: Automated installation via scripts
- Configuration: Dynamic configuration generation and deployment
- Service Management: Systemd or process-based service management
- Health Monitoring: Continuous health and status monitoring
Security Considerations
Authentication and Authorization
- SSH Key Management: Secure SSH key-based authentication
- Elasticsearch Security: Secure Elasticsearch connections
- Configuration Security: Encrypted configuration storage
- Network Security: Secure network communications
Data Protection
- Encryption in Transit: TLS/SSL for all network communications
- Access Control: Role-based access control for server operations
- Audit Logging: Comprehensive audit trails for all operations
- Credential Management: Secure credential storage and rotation
Performance Characteristics
Processing Capacity
- Span Throughput: Processes thousands of correlated spans per minute
- Assembly Performance: Efficient trace assembly algorithms
- Storage Performance: Optimized Elasticsearch operations
- Agent Management: Concurrent agent operations
Scalability Features
- Horizontal Scaling: Multiple server instances for load distribution
- Agent Scaling: Support for hundreds of distributed agents
- Storage Scaling: Elasticsearch cluster scaling support
- Processing Scaling: Parallel processing capabilities
Troubleshooting and Monitoring
Common Issues
Agent Connectivity
# Test SSH connectivity
ssh user@agent_host
# Check agent status
sudo systemctl status deeptrace-agent
# View agent logs
sudo journalctl -u deeptrace-agent -f
Processing Issues
# Check Elasticsearch connectivity
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
print(es.ping())
# Monitor span processing
log(f"Processing {len(spans)} spans")
log(f"Assembled {trace_num} traces")
Monitoring Best Practices
- Health Monitoring: Regular agent health checks
- Performance Monitoring: Track processing metrics
- Error Monitoring: Monitor error rates and patterns
- Resource Monitoring: Track system resource usage
- Log Analysis: Regular log analysis for issues
Integration Points
External Systems
- Elasticsearch: Primary data storage and retrieval
- Git Repository: Source code management and distribution
- SSH Infrastructure: Remote agent management
- Monitoring Systems: Integration with external monitoring
API Interfaces
- Agent APIs: Communication with agent REST APIs
- Elasticsearch APIs: Direct Elasticsearch integration
- Management APIs: Server management and control interfaces
- Monitoring APIs: Health and status reporting interfaces
This server architecture provides a comprehensive foundation for distributed tracing management, offering scalable agent management, efficient data processing, and robust operational capabilities.