Server Architecture

The DeepTrace Server is a Python-based distributed system responsible for managing agents, processing correlated spans from Elasticsearch, performing trace assembly, and providing management interfaces. This document provides a detailed overview of the server's architecture, components, and operational principles based on the actual implementation.

Overview

The DeepTrace Server operates as a centralized control and processing system that:

Manages Agent Lifecycle: Deploys, configures, and monitors distributed agents
Processes Correlated Span Data: Retrieves correlated spans from Elasticsearch for assembly
Performs Trace Assembly: Assembles correlated spans into complete distributed traces
Provides Management Interface: Offers APIs and tools for system administration

Architecture Diagram

graph TB
    subgraph "DeepTrace Server"
        subgraph "Agent Management"
            AGENT_MGR[Agent Manager]
            SSH_CLIENT[SSH Client]
            DEPLOY[Deployment Controller]
        end
        
        subgraph "Data Processing"
            SPAN_POLLER[Span Poller]
            ASSEMBLER[Trace Assembler]
        end
        
        subgraph "Storage Interface"
            ES_CLIENT[Elasticsearch Client]
            DB_UTILS[Database Utils]
        end
        
        subgraph "Configuration"
            CONFIG_PARSER[Config Parser]
            TOML_CONFIG[TOML Configuration]
        end
    end
    
    subgraph "External Systems"
        AGENTS[Remote Agents]
        ES[(Elasticsearch)]
        SSH[SSH Hosts]
    end
    
    AGENT_MGR --> SSH_CLIENT
    SSH_CLIENT --> SSH
    SSH --> AGENTS
    DEPLOY --> AGENTS
    
    SPAN_POLLER --> ES_CLIENT
    ES_CLIENT --> ES
    SPAN_POLLER --> ASSEMBLER
    ASSEMBLER --> ES_CLIENT
    
    CONFIG_PARSER --> TOML_CONFIG
    CONFIG_PARSER --> AGENT_MGR

Core Components

1. Agent Management System

The server provides comprehensive agent lifecycle management:

Agent Class

Purpose: Represents and manages individual agent instances
Key Features:
- SSH-based remote command execution
- Configuration synchronization
- Code deployment and installation
- Process management (start/stop/restart)
- Health monitoring and status tracking

Agent Operations

class Agent:
    def __init__(self, agent_config, elastic_config, server_config):
        # SSH connection management
        self.ssh_client = None
        self.host_ip = agent_config['agent_info']['host_ip']
        self.ssh_port = agent_config['agent_info']['ssh_port']
        self.user_name = agent_config['agent_info']['user_name']
        self.host_password = agent_config['agent_info']['host_password']
    
    def clone_code(self):
        # Git clone from repository
        repo_url = 'https://gitee.com/gytlll/DeepTrace.git'
        
    def install(self):
        # Run installation script
        command = "bash scripts/install_agent.sh"
        
    def sync_config(self):
        # Generate and deploy TOML configuration
        
    def run(self):
        # Start agent process
        command = "bash scripts/run_agent.sh"
        
    def stop(self):
        # Stop agent process
        command = "bash scripts/stop_agent.sh"

Configuration Management

Dynamic Configuration: Generates agent-specific TOML configurations
Hot Reload: Supports runtime configuration updates via API
Template System: Uses server configuration to generate agent configs
Validation: Ensures configuration consistency across agents

2. Data Processing Pipeline

The server implements a sophisticated data processing pipeline:

Span Polling

Purpose: Continuously retrieves new spans from Elasticsearch
Implementation: poll_agents_new_spans() function
Features:
- Multi-agent span collection
- Configurable polling intervals
- Queue-based processing
- Error handling and retry logic

Trace Assembly Engine

def span2trace(correlated_spans):
    # Step 1: Process correlated spans
    spans = process_correlated_spans(correlated_spans)
    
    # Step 2: Span merging
    span_list = span_merge(spans)
    
    # Step 3: Trace assembly
    trace_num = assemble_trace_from_spans(span_list, 'traces')

Processing Components

Span Processing: Processes correlated spans from agents
Span Merge: Consolidates related spans
Trace Assembler: Builds complete trace structures from correlated spans

3. Storage Interface

The server provides comprehensive Elasticsearch integration:

Database Utilities

Connection Management: Elasticsearch client initialization
Index Management: Automatic index creation and management
Bulk Operations: Efficient batch data operations
Query Interface: Structured query building and execution

Key Functions

def es_write_agent_config(agent_config, elastic_config, server_config):
    # Store agent configuration in Elasticsearch
    
def poll_agents_new_spans(agents, queue, interval):
    # Retrieve new spans from multiple agents
    
def check_db():
    # Verify database connectivity and health

4. Configuration System

The server uses a TOML-based configuration system:

Configuration Structure

[elastic]
elastic_password = "password"  # Elasticsearch authentication

[server]
ip = "server_ip"              # Server external IP

[[agents]]
  [agents.agent_info]
  agent_name = "agent1"        # Unique agent identifier
  user_name = "username"       # SSH username
  host_ip = "agent_ip"         # Agent host IP
  ssh_port = 22                # SSH port
  host_password = "password"   # SSH password

Configuration Features

Multi-Agent Support: Array-based agent configuration
Environment Specific: Separate configs for different environments
Validation: Schema validation and error handling
Dynamic Loading: Runtime configuration reloading

Data Flow Architecture

1. Agent Management Flow

Configuration → Agent Creation → SSH Connection → Remote Operations

2. Span Processing Flow

Elasticsearch → Span Polling → Queue → Assembly → Storage

3. Deployment Flow

Config Parsing → Agent Initialization → Code Clone → Installation → Configuration Sync → Agent Start

Operational Modes

The server supports different operational modes:

Automatic Mode

Default Operation: Continuous correlated span processing
Background Processing: Automated trace assembly
Health Monitoring: Continuous agent health checks

Manual Mode

Interactive Control: Manual agent management
Debug Mode: Enhanced logging and debugging
Maintenance Mode: System maintenance operations

Management Interface

Command Line Interface

The server provides various management utilities:

Agent Management

def install_agents(agents):
    # Parallel agent installation
    
def start_agents(agents):
    # Start all configured agents
    
def stop_agents(agents):
    # Stop all running agents
    
def update_agent_config(agents):
    # Hot reload agent configurations
    
def test_agents(agents):
    # Test agent connectivity and health

Monitoring Functions

Health Checks: Agent connectivity and status monitoring
Performance Metrics: Processing statistics and performance data
Error Tracking: Comprehensive error logging and tracking
Resource Monitoring: System resource usage tracking

Deployment Architecture

Server Requirements

Python Runtime: Python 3.x with required dependencies
Network Access: SSH access to agent hosts
Elasticsearch: Connection to Elasticsearch cluster
Configuration: Proper TOML configuration files

Agent Deployment Process

Code Distribution: Git clone from central repository
Installation: Automated installation via scripts
Configuration: Dynamic configuration generation and deployment
Service Management: Systemd or process-based service management
Health Monitoring: Continuous health and status monitoring

Security Considerations

Authentication and Authorization

SSH Key Management: Secure SSH key-based authentication
Elasticsearch Security: Secure Elasticsearch connections
Configuration Security: Encrypted configuration storage
Network Security: Secure network communications

Data Protection

Encryption in Transit: TLS/SSL for all network communications
Access Control: Role-based access control for server operations
Audit Logging: Comprehensive audit trails for all operations
Credential Management: Secure credential storage and rotation

Performance Characteristics

Processing Capacity

Span Throughput: Processes thousands of correlated spans per minute
Assembly Performance: Efficient trace assembly algorithms
Storage Performance: Optimized Elasticsearch operations
Agent Management: Concurrent agent operations

Scalability Features

Horizontal Scaling: Multiple server instances for load distribution
Agent Scaling: Support for hundreds of distributed agents
Storage Scaling: Elasticsearch cluster scaling support
Processing Scaling: Parallel processing capabilities

Troubleshooting and Monitoring

Common Issues

Agent Connectivity

# Test SSH connectivity
ssh user@agent_host

# Check agent status
sudo systemctl status deeptrace-agent

# View agent logs
sudo journalctl -u deeptrace-agent -f

Processing Issues

# Check Elasticsearch connectivity
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
print(es.ping())

# Monitor span processing
log(f"Processing {len(spans)} spans")
log(f"Assembled {trace_num} traces")

Monitoring Best Practices

Health Monitoring: Regular agent health checks
Performance Monitoring: Track processing metrics
Error Monitoring: Monitor error rates and patterns
Resource Monitoring: Track system resource usage
Log Analysis: Regular log analysis for issues

Integration Points

External Systems

Elasticsearch: Primary data storage and retrieval
Git Repository: Source code management and distribution
SSH Infrastructure: Remote agent management
Monitoring Systems: Integration with external monitoring

API Interfaces

Agent APIs: Communication with agent REST APIs
Elasticsearch APIs: Direct Elasticsearch integration
Management APIs: Server management and control interfaces
Monitoring APIs: Health and status reporting interfaces

This server architecture provides a comprehensive foundation for distributed tracing management, offering scalable agent management, efficient data processing, and robust operational capabilities.

Keyboard shortcuts

DeepTrace Documentation