System Architecture

Overview

The IBDGC Integrated Data Hub (IDhub) is a microservices-based data integration platform designed to centralize biobank and clinical data from multiple sources while maintaining data quality, provenance, and subject identity consistency.

Architecture Principles

1. Separation of Concerns

Each service has a single, well-defined responsibility:

GSID Service: Subject identity management
REDCap Pipeline: Data extraction and transformation
Fragment Validator: Data quality validation
Table Loader: Database persistence
Nginx: Routing and SSL termination

2. Staged Data Pipeline

Data flows through distinct stages with validation gates:

Source → Extract → Stage → Validate → Queue → Load → Database

Each stage can fail independently without affecting others, enabling retry logic and error recovery.

3. Immutable Staging

Data fragments in S3 are immutable once created, providing:

Complete audit trail
Ability to replay pipelines
Source of truth for debugging
Disaster recovery capability

4. Natural Key Strategy

Records are identified by business keys (natural keys) rather than database IDs, enabling:

Idempotent operations
Cross-system reconciliation
Intelligent upserts
Data deduplication

High-Level Architecture

graph TB
    subgraph "External Systems"
        RC[REDCap Projects]
        LK[LabKey]
        MU[Manual Uploads]
    end

    subgraph "Ingestion Services"
        RCP[REDCap Pipeline<br/>Python Service]
        FV[Fragment Validator<br/>Python Service]
    end

    subgraph "Storage Layer"
        S3[(S3 Bucket<br/>Curated Fragments)]
        VQ[(PostgreSQL<br/>Validation Queue)]
    end

    subgraph "Loading Services"
        TL[Table Loader<br/>Python Service]
        GS[GSID Service<br/>FastAPI]
    end

    subgraph "Data Layer"
        DB[(PostgreSQL<br/>Main Database)]
    end

    subgraph "Access Layer"
        NX[Nginx<br/>Reverse Proxy]
        NC[NocoDB<br/>Web UI]
        API[REST API]
    end

    subgraph "Orchestration"
        GHA[GitHub Actions<br/>Workflows]
    end

    RC -->|API| RCP
    LK -->|Export| FV
    MU -->|Upload| FV

    RCP -->|Upload| S3
    FV -->|Upload| S3

    S3 -->|Read| FV
    FV -->|Insert| VQ

    VQ -->|Read| TL
    TL -->|Upsert| DB

    GS <-->|Query/Create| DB
    FV -->|Resolve GSID| GS
    TL -->|Resolve GSID| GS

    DB -->|Query| NC
    DB -->|Query| API

    NX -->|Proxy| NC
    NX -->|Proxy| GS
    NX -->|Proxy| API

    GHA -.->|Trigger| RCP
    GHA -.->|Trigger| FV
    GHA -.->|Trigger| TL

    style S3 fill:#FF9800
    style VQ fill:#2196F3
    style DB fill:#4CAF50
    style GS fill:#9C27B0

Component Architecture

GSID Service

Purpose: Centralized global subject ID management

Technology: FastAPI (Python), PostgreSQL

Key Features:

GSID generation (Custom format)
Local ID to GSID resolution
Fuzzy matching for subject identification
RESTful API with authentication

graph LR
    A[Client Request] --> B[FastAPI Router]
    B --> C{Endpoint}
    C -->|/generate| D[Generate GSID]
    C -->|/resolve| E[Resolve Local ID]
    C -->|/batch| F[Batch Operations]

    D --> G[Database]
    E --> G
    F --> G

    G --> H[Return Response]

Database Tables:

subjects: Core subject records with GSID
local_subject_ids: Mapping of local IDs to GSIDs

Detailed documentation →

REDCap Pipeline

Purpose: Extract and transform data from REDCap projects

Technology: Python, REDCap API, S3

Key Features:

Multi-project support
Incremental extraction
Field mapping and transformation
Fragment generation

graph TB
    A[REDCap API] --> B[Extract Records]
    B --> C[Apply Field Mappings]
    C --> D[Transform Data]
    D --> E[Generate Fragments]
    E --> F[Upload to S3]
    F --> G[Update Metadata]

Configuration:

config/projects.json: Project definitions
config/*_field_mappings.json: Field mapping rules

Detailed documentation →

Fragment Validator

Purpose: Validate data quality before database loading

Technology: Python, S3, PostgreSQL

Key Features:

Schema validation
GSID resolution
Business rule validation
Duplicate detection

graph TB
    A[S3 Fragment] --> B[Load Fragment]
    B --> C[Schema Validation]
    C --> D{Valid?}
    D -->|No| E[Reject]
    D -->|Yes| F[Resolve GSID]
    F --> G{GSID Found?}
    G -->|No| H[Create Subject]
    G -->|Yes| I[Continue]
    H --> I
    I --> J[Business Rules]
    J --> K{Valid?}
    K -->|No| E
    K -->|Yes| L[Queue for Loading]

    E --> M[Log Error]
    L --> N[Validation Queue]

Validation Steps:

Schema Validation: Field types, required fields
GSID Resolution: Map local IDs to GSIDs
Business Rules: Domain-specific validation
Duplicate Detection: Check for existing records

Detailed documentation →

Table Loader

Purpose: Load validated data into database with update strategy

Technology: Python, PostgreSQL

Key Features:

Natural key-based upserts
Immutable field protection
Batch processing
Transaction management

graph TB
    A[Validation Queue] --> B[Read Batch]
    B --> C[Group by Table]
    C --> D[For Each Record]
    D --> E{Natural Key Exists?}
    E -->|No| F[INSERT]
    E -->|Yes| G{Immutable Changed?}
    G -->|Yes| H[Reject]
    G -->|No| I{Data Changed?}
    I -->|No| J[Skip]
    I -->|Yes| K[UPDATE]

    F --> L[Commit]
    K --> L
    H --> M[Log Error]
    J --> N[Log Skip]
    L --> O[Mark as Loaded]

Configuration:

config/table_configs.json: Natural keys, immutable fields

Detailed documentation →

Nginx Proxy

Purpose: Reverse proxy, SSL termination, routing

Technology: Nginx

Key Features:

SSL/TLS termination
Request routing
Rate limiting
Static file serving

graph LR
    A[Client] -->|HTTPS| B[Nginx]
    B -->|/| C[NocoDB]
    B -->|/api/gsid| D[GSID Service]
    B -->|/api/data| E[Data API]

    style B fill:#4CAF50

Detailed documentation →

Data Flow Architecture

End-to-End Data Flow

sequenceDiagram
    participant SRC as Data Source
    participant EXT as Extractor
    participant S3 as S3 Staging
    participant VAL as Validator
    participant GSID as GSID Service
    participant QUEUE as Validation Queue
    participant LOAD as Loader
    participant DB as Database

    SRC->>EXT: 1. Extract data
    EXT->>EXT: 2. Transform & map
    EXT->>S3: 3. Upload fragment

    Note over S3: Fragment stored immutably

    S3->>VAL: 4. Process fragment
    VAL->>VAL: 5. Schema validation

    VAL->>GSID: 6. Resolve GSID
    GSID->>GSID: 7. Lookup/create
    GSID-->>VAL: 8. Return GSID

    VAL->>VAL: 9. Business rules
    VAL->>QUEUE: 10. Queue validated data

    Note over QUEUE: Awaiting batch load

    QUEUE->>LOAD: 11. Read batch
    LOAD->>LOAD: 12. Apply update strategy
    LOAD->>DB: 13. Upsert records
    LOAD->>QUEUE: 14. Mark as loaded

Detailed data flow →

Storage Architecture

S3 Structure

s3://idhub-curated-fragments/
├── redcap/
│   ├── gap/
│   │   ├── batch_20240115_100000/
│   │   │   ├── lcl/
│   │   │   │   ├── fragment_001.json
│   │   │   │   ├── fragment_002.json
│   │   │   │   └── ...
│   │   │   ├── genotype/
│   │   │   ├── sequence/
│   │   │   └── metadata.json
│   │   └── batch_20240116_100000/
│   └── uc_demarc/
├── labkey/
│   └── export_20240115/
└── manual/
    └── upload_20240115_143000/

Key Characteristics:

Organized by source and project
Batch-based organization
Immutable once written
Metadata files for tracking

Database Schema

erDiagram
    subjects ||--o{ local_subject_ids : has
    subjects ||--o{ lcl : has
    subjects ||--o{ genotype : "has"
    subjects ||--o{ sequence : "has"
    subjects ||--o{ specimen : has

    subjects {
        uuid id PK
        string gsid UK
        string sex
        string diagnosis
        timestamp created_at
    }

    local_subject_ids {
        uuid id PK
        uuid subject_id FK
        int center_id
        string local_subject_id
        string identifier_type
        timestamp created_at
    }

    lcl {
        uuid id PK
        uuid subject_id FK
        string global_subject_id
        string niddk_no
        string knumber
        int passage_number
        string cell_line_status
        timestamp created_at
    }

    genotype {
        uuid id PK
        uuid subject_id FK
        string global_subject_id
        string genotype_id
        string genotyping_project
        string genotyping_barcode
        timestamp created_at
    }

    sequence {
        uuid id PK
        uuid subject_id FK
        string global_subject_id
        string sample_id
        string sample_type
        string vcf_sample_id
        timestamp created_at
    }

    specimen {
        uuid id PK
        string sample_id UK
        string sample_type
        string storage_location
        timestamp collection_date
        timestamp created_at
    }

Detailed schema documentation →

Security Architecture

Authentication & Authorization

graph TB
    A[Client Request] --> B{Has API Key?}
    B -->|No| C[401 Unauthorized]
    B -->|Yes| D{Valid Key?}
    D -->|No| C
    D -->|Yes| E{Has Permission?}
    E -->|No| F[403 Forbidden]
    E -->|Yes| G[Process Request]

Security Layers

Security Layers:

Network Security
- SSL/TLS encryption (Let's Encrypt)
- Nginx reverse proxy
- Firewall rules
Application Security
- API key authentication
- Environment-based secrets
- Input validation
Database Security
- Connection pooling
- Prepared statements
- Role-based access
Data Security
- Encrypted at rest (S3, RDS)
- Encrypted in transit (HTTPS)
- Audit logging

Detailed security documentation →

Deployment Architecture

Environment Structure

graph TB
    subgraph "Production"
        P_APP[Application Services]
        P_DB[(Production DB)]
        P_S3[(Production S3)]
    end

    subgraph "QA"
        Q_APP[Application Services]
        Q_DB[(QA DB)]
        Q_S3[(QA S3)]
    end

    subgraph "Development"
        D_APP[Application Services]
        D_DB[(Local DB)]
        D_S3[(Local S3/MinIO)]
    end

    GH[GitHub Actions] -.->|Deploy| P_APP
    GH -.->|Deploy| Q_APP

    DEV[Developers] -->|Test| D_APP
    DEV -->|PR| GH

Environments:

Environment	Purpose	Database	S3 Bucket
Development	Local development	Local PostgreSQL	Local MinIO
QA	Testing & validation	QA RDS	`idhub-curated-fragments-qa`
Production	Live system	Production RDS	`idhub-curated-fragments`

Deployment Process

graph LR
    A[Code Push] --> B[GitHub Actions]
    B --> C{Branch?}
    C -->|main| D[Deploy to QA]
    C -->|release| E[Deploy to Prod]
    D --> F[Run Tests]
    F --> G{Tests Pass?}
    G -->|Yes| H[Deploy Services]
    G -->|No| I[Rollback]
    E --> J[Manual Approval]
    J --> H

Detailed deployment documentation →

Scalability Considerations

Current Scale

Subjects: ~50,000
LCL Lines: ~30,000
Genotypes: ~40,000
Sequences: ~20,000
Daily Ingestion: ~1,000 records

Scaling Strategies

Horizontal Scaling:

Multiple validator instances
Multiple loader instances
Load balancing via Nginx

Vertical Scaling:

Database connection pooling
Batch processing optimization
Query optimization

Data Partitioning

S3 partitioning by date/source
Database table partitioning (future)
Archive old validation queue records

Monitoring & Observability

Metrics

graph TB
    A[Application Metrics] --> D[Monitoring Dashboard]
    B[Database Metrics] --> D
    C[Infrastructure Metrics] --> D

    A --> A1[Request Rate]
    A --> A2[Error Rate]
    A --> A3[Processing Time]

    B --> B1[Query Performance]
    B --> B2[Connection Pool]
    B --> B3[Table Sizes]

    C --> C1[CPU Usage]
    C --> C2[Memory Usage]
    C --> C3[Disk I/O]

Key Metrics

Pipeline success/failure rates
GSID resolution performance
Database load times
Validation queue depth
API response times

Technology Stack

Languages & Frameworks

Component	Technology	Version
GSID Service	Python, FastAPI	3.11, 0.104+
REDCap Pipeline	Python	3.11
Fragment Validator	Python	3.11
Table Loader	Python	3.11
Database	PostgreSQL	15+
Web UI	NocoDB	Latest
Proxy	Nginx	1.24+

Key Libraries

Database: asyncpg, psycopg2
API: fastapi, uvicorn, pydantic
AWS: boto3
Testing: pytest, pytest-asyncio
Validation: jsonschema, pydantic
ETL: pandas, openpyxl

System Architecture

Overview

Architecture Principles

High-Level Architecture

Component Architecture

Data Flow Architecture

Storage Architecture

Security Architecture

Deployment Architecture

Scalability Considerations

Monitoring & Observability

Technology Stack

Related Documentation