# Enterprise ERP Specification - Topology, Security, & DR Architecture

This document defines the high-level system topology, microservice interaction patterns, security infrastructure, backup procedures, and disaster recovery strategy for the Company Operating System (ERP) designed to scale from 10 to 10,000+ employees.

---

## 1. Microservice Topology

The system is designed on **Domain-Driven Design (DDD)** principles and deployed as decoupled microservices using Docker containers orchestrated by Kubernetes (EKS/GKE).

```mermaid
graph TD
    Client[Web/Mobile Clients] -->|HTTPS/WSS| WAF[Web Application Firewall]
    WAF -->|Filter Traffic| CDN[Cloudfront / Cloudflare CDN]
    CDN -->|Forward Requests| Gateway[Kong API Gateway]
    
    subgraph Auth_Security
        Gateway -->|Validate JWT / Auth| IDP[Keycloak Identity Provider]
    </div>

    subgraph Service_Registry
        Gateway -->|Discover Services| Consul[Consul Service Registry]
    </div>

    subgraph Core_Microservices
        Gateway -->|Route /employee/*| EmpService[Employee Management Service]
        Gateway -->|Route /finance/*| FinService[Finance & Payroll Service]
        Gateway -->|Route /projects/*| PMOService[PMO & Task Service]
        Gateway -->|Route /crm/*| CRMService[CRM & Client Service]
        Gateway -->|Route /marketing/*| MktService[Marketing Automation Service]
        Gateway -->|Route /operations/*| OpsService[Operations & Assets Service]
        Gateway -->|Route /analytics/*| Analytics[Analytics & Reporting Service]
    </div>

    subgraph Event_Bus
        EmpService -->|Publish Events| Kafka[Apache Kafka Message Broker]
        FinService -->|Publish/Subscribe| Kafka
        PMOService -->|Publish/Subscribe| Kafka
        CRMService -->|Publish/Subscribe| Kafka
        OpsService -->|Publish/Subscribe| Kafka
    </div>

    subgraph Persistence_Layer
        EmpService -->|Write/Read| DB_Postgres[(PostgreSQL Master)]
        FinService -->|Write/Read| DB_Postgres
        CRMService -->|Write/Read| DB_Postgres
        
        PMOService -->|Write/Read| DB_Mongo[(MongoDB Clustered)]
        OpsService -->|Write/Read| DB_Mongo
        
        EmpService -->|Cache Sessions| DB_Redis[(Redis Cache Cluster)]
        PMOService -->|Cache Timers| DB_Redis
        
        Analytics -->|OLAP Sync| DB_Clickhouse[(ClickHouse Analytics DB)]
    end
```

### Microservice Directory & Tech Stack

| Microservice | Boundary Domain | Primary Database | Key Technologies |
| :--- | :--- | :--- | :--- |
| **Identity Service** | Auth, RBAC, User Profiles | PostgreSQL | Keycloak OAuth2, OIDC |
| **Employee Service** | HRMS, Leaves, Attendance, OKRs | PostgreSQL | Java Spring Boot, Hibernate |
| **Finance Service** | Invoices, Expense, Payroll, Cash Flow | PostgreSQL (ACID compliant) | Java Spring Boot, Spring Batch |
| **PMO Service** | Projects, Sprints, Kanban, Task Timers | MongoDB / Redis | Node.js NestJS, Mongoose |
| **CRM Service** | Leads, Quotes, Proposals, Clients | PostgreSQL | Python FastAPI, SQLAlchemy |
| **Marketing Service** | Email Automation, Social media, Campaigns | PostgreSQL | Python FastAPI, Celery |
| **Ops & Asset Service** | Assets, SOPs, Procurement, Helpdesk | MongoDB | Node.js NestJS |
| **Analytics Service** | Reports, Productivity, Profitability | ClickHouse (OLAP) | Python Pandas, PySpark, ClickHouse |

---

## 2. API Gateway & Interaction Flow

The **Kong API Gateway** serves as the single entry point for all clients.

### API Gateway Responsibilities:
1. **SSL/TLS Termination**: Enforce TLS 1.3 for all incoming connections.
2. **Reverse Proxying**: Routing traffic to backend services based on URI path configurations (e.g., `/api/v1/hrms/*` routes to Employee Service).
3. **Rate Limiting**: Defend downstream microservices from DDoS attacks using a token bucket algorithm:
   - *Public Routes* (Login): Max 30 requests/minute per IP.
   - *Standard Authenticated Routes*: Max 500 requests/minute per User ID.
   - *Heavy Analytics Exports*: Max 5 requests/minute per User ID.
4. **CORS Enforcement**: Strictly whitelist corporate dashboard domains.
5. **JWT Extraction & Validation**: Intercept and validate authorization headers against the Public Key Cryptography Standards (PKCS#1) endpoint of the Identity Provider (IdP).

---

## 3. Security Architecture

### Authentication & Authorization Framework
- **OAuth2 + OpenID Connect (OIDC)**: Delegated authentication managed via integrated Keycloak/Okta.
- **Multi-Factor Authentication (MFA)**: Mandatory for all roles (TOTP via Google Authenticator/Duo, or WebAuthn hardware keys).
- **JWT (JSON Web Token) Structure**: JWT tokens are signed using `RS256` asymmetric keys. They contain encrypted scopes and roles claims:
  ```json
  {
    "sub": "usr_992a76f2c3004e8d",
    "name": "Rahul Verma",
    "email": "rahul@company.com",
    "tenant_id": "tnt_001",
    "role": "Employee",
    "departments": ["Engineering"],
    "permissions": ["task:start", "task:log", "attendance:checkin", "leave:apply"],
    "exp": 1779998400
  }
  ```

### Data Security (At Rest & In Transit)
- **Encryption in Transit**: All internal microservice-to-microservice calls run inside a service mesh (Istio) using **Mutual TLS (mTLS)** with automatically rotated certificates.
- **Encryption at Rest**: Databases use Transparent Data Encryption (TDE) backed by cloud hardware security modules (AWS KMS / GCP Cloud KMS) using envelope encryption (AES-256).
- **Sensitive Field Masking**: HR Payroll amounts, SSN/Pan cards, bank accounts, and passwords are encrypted at the application level before database entry using field-level encryption (HMAC-SHA256).

---

## 4. Backup & Disaster Recovery (DR) Architecture

To secure operations at MNC-scale (10,000+ employees), we implement a hybrid-cloud Multi-Availability Zone (Multi-AZ) and Multi-Region deployment.

### RTO (Recovery Time Objective) and RPO (Recovery Point Objective)

| Tier | Services | Target RTO | Target RPO |
| :--- | :--- | :--- | :--- |
| **Tier 1 (Critical)** | Identity, PMO (Timers), HRMS (Attendance) | < 15 minutes | < 1 minute |
| **Tier 2 (High)** | CRM, Client Invoices, Expense, Procurement | < 1 hour | < 5 minutes |
| **Tier 3 (Medium)** | Marketing Automations, Helpdesk, Asset lists | < 4 hours | < 1 hour |

### Backup Architecture Spec
1. **PostgreSQL**: 
   - Write-Ahead Logging (WAL) archived to Amazon S3 every 60 seconds (Point-in-Time Recovery - PITR).
   - Daily full logical backup (`pg_dump`) retained for 7 years in cold storage (AWS Glacier) for compliance audits.
2. **MongoDB Clustered Data**:
   - Continuous MongoDB oplog tailing backups.
   - Weekly snapshots retained for 6 months.
3. **Object Storage (SOPs, Resumes, Contracts)**:
   - S3 Bucket Cross-Region Replication (CRR) enabled with versioning.

### Disaster Recovery Configuration (Active-Passive Warm Standby)

```mermaid
graph TD
    Route53[AWS Route 53 DNS Latency/Failover] -->|Primary Route| RegionA[AWS Region: Primary AP-South-1]
    Route53 -->|Failover Trigger| RegionB[AWS Region: Disaster Recovery AP-Southeast-1]
    
    subgraph RegionA_AP_South_1
        RegionA --> LB_A[Elastic Load Balancer]
        LB_A --> EKS_A[Kubernetes Pods - Running]
        EKS_A --> DB_A[(PostgreSQL Master)]
        DB_A -->|Continuous Replication| S3_WAL_A[S3 WAL Archiving]
    end

    subgraph RegionB_AP_Southeast_1
        RegionB --> LB_B[Elastic Load Balancer]
        LB_B --> EKS_B[Kubernetes Pods - Auto-scaled to 0]
        EKS_B --> DB_B[(PostgreSQL Replica - Read Only)]
    </div>

    DB_A -->|Cross-Region Streaming Replication| DB_B
```

- **Failover Automation**: Route 53 Health Checks verify API Gateway heartbeat every 10 seconds. If Region A goes offline for 3 consecutive checks (30 seconds):
  1. DNS automatically shifts 100% traffic to Region B.
  2. A Lambda function promotes PostgreSQL replica in Region B to Master.
  3. Kubernetes cluster in Region B auto-scales services from 0 to target replica counts.
  4. Notification alerts are dispatched to the SRE/Infra On-call rotation.
