Commit 8f06af7f authored by Dimitrios Amaxilatis's avatar Dimitrios Amaxilatis
Browse files

add wip ideas

parent c856bf1b
Loading
Loading
Loading
Loading
+199 −0
Original line number Diff line number Diff line
# DataLab-OpenOP — Federated DataOps for ETSI OpenOP

**Status:** Concept / Design  
**Date:** 2026-05-26  
**Origin:** 6G-DALI project (SNS-JU)  
**Scope:** A reusable OpenOP capability for federated DataOps across operator nodes

---

## 1. Vision

Each telecom operator running an ETSI OpenOP instance hosts a **local mini data catalogue and data lake**. Datasets generated by that operator's 5G/6G testbeds are registered locally. A central **"datalab-openop"** catalogue federates all operator catalogues, giving users a unified view of all available data assets across the network.

Each operator node is fully self-contained: users discover datasets and services, compose pipelines, and monitor execution entirely through **the node's own DataOps UI**. The node's local catalogue registers everything produced on that node.

The central **"datalab-openop"** piveau-hub is a pure federation layer — machine-to-machine only. It harvests metadata from all operator nodes and exposes it to external systems (other data spaces, GAIA-X, SLICES-RI, auditors, consortium reporting). It is not user-facing for DataOps workflows. There is no central orchestrator. Raw data never leaves the operator's domain.

---

## 2. Architecture Overview

```
         ◄─── User interacts here only ───►

  Operator A (OpenOP node)                  Operator B (OpenOP node)
  ┌───────────────────────────────────┐     ┌───────────────────────────────────┐
  │  ┌─ Local DataOps ──────────────┐ │     │  ┌─ Local DataOps ──────────────┐ │
  │  │  DataOps UI  ◄── user        │ │     │  │  DataOps UI  ◄── user        │ │
  │  │  DataOps Orchestrator        │ │     │  │  DataOps Orchestrator        │ │
  │  │  Apache Airflow              │ │     │  │  Apache Airflow              │ │
  │  │  Task library                │ │     │  │  Task library                │ │
  │  └──────────────────────────────┘ │     │  └──────────────────────────────┘ │
  │  ┌─ Local Data Space ───────────┐ │     │  ┌─ Local Data Space ───────────┐ │
  │  │  piveau-hub                  │ │     │  │  piveau-hub                  │ │
  │  │  (datalab-operator-a)        │ │     │  │  (datalab-operator-b)        │ │
  │  └──────────────────────────────┘ │     │  └──────────────────────────────┘ │
  │  ┌─ Local Data Lake ────────────┐ │     │  ┌─ Local Data Lake ────────────┐ │
  │  │  MinIO / S3                  │ │     │  │  MinIO / S3                  │ │
  │  └──────────────────────────────┘ │     │  └──────────────────────────────┘ │
  │  ┌─ Local Data Connector ───────┐ │     │  ┌─ Local Data Connector ───────┐ │
  │  │  EDC (DSP)                   │ │     │  │  EDC (DSP)                   │ │
  │  │  policy + contract + transfer│ │     │  │  policy + contract + transfer│ │
  │  └──────────────┬───────────────┘ │     │  └──────────────┬───────────────┘ │
  └─────────────────┼─────────────────┘     └─────────────────┼─────────────────┘
                    │   EDC-to-EDC (DSP)                       │
                    │   contract negotiation + data transfer    │
                    └──────────────────┬───────────────────────┘
                                       │ (cross-node only)
                    DCAT-AP harvest    │
  ┌────────────────────────────────────▼──────────────────────────────────────┐
  │   Central "datalab-openop" piveau-hub                                      │
  │   Machine-to-machine only:                                                 │
  │   · GAIA-X federation  · SLICES-RI  · Consortium reporting                │
  │   · External data space queries  · Audit / compliance                     │
  └────────────────────────────────────────────────────────────────────────────┘

         No user DataOps workflow touches the central catalogue.
```

### Core Principles

> **Each node is fully autonomous. The central catalogue is federation infrastructure, not a user surface.**

- **Users** interact exclusively with their operator node's DataOps UI — for discovery, pipeline composition, execution, and monitoring
- **Datasets** stay in the operator's local lake at all times
- **Dataset metadata** (DCAT-AP RDF) travels from local catalogues to the central catalogue via harvesting
- **Data service metadata** (`dcat:DataService`) is published by each node and harvested centrally — the central catalogue records what services exist, but plays no role in running them
- **Derived dataset metadata** is registered in the local catalogue after pipeline execution, then harvested centrally
- The central catalogue is **machine-to-machine only**: GAIA-X federation, SLICES-RI cross-registration, external data space interoperability, consortium reporting

---

## 3. Components per OpenOP Node

Each operator node is made up of four logical blocks:

### Local Data Space
The metadata catalogue for the node. Registers and exposes all datasets, derived datasets, and data services produced on this node. Feeds the central federation layer via DCAT-AP harvesting.

| Component | Role |
|---|---|
| **piveau-hub** (node instance) | DCAT-AP catalogue; stores dataset and service metadata; exposes REST + SPARQL; publishes to central via harvest |

### Local Data Lake
The storage layer. All raw and derived data lives here and never leaves the node.

| Component | Role |
|---|---|
| **Object store** (MinIO / S3-compatible) | Stores raw testbed datasets and derived pipeline outputs |

### Local DataOps
The execution and user-facing layer. Users discover assets, compose pipelines, trigger execution, and monitor jobs entirely here.

| Component | Role |
|---|---|
| **DataOps Orchestrator** (FastAPI) | REST API; bridges the UI and Airflow; manages datasets, services, DAG creation, and pipeline triggers |
| **DataOps UI** (React) | User interface for dataset discovery, service browsing, pipeline composition, and job monitoring |
| **Apache Airflow** | Executes DataOps pipelines locally against the local data lake |
| **Local task library** | DataOps service implementations available as Airflow `PythonOperator` tasks; authored locally or adopted from the consortium service library |

### Local Data Connector
The policy enforcement and data transfer layer. Sits in front of the local data lake and governs all data access — both inbound (other nodes requesting data from this node) and outbound (this node's pipelines accessing data from another node in a cross-operator scenario).

| Component | Role |
|---|---|
| **Eclipse Dataspace Connector (EDC)** | Implements the IDSA Dataspace Protocol (DSP); negotiates data contracts based on ODRL policies; proxies data transfers so raw lake credentials are never shared; produces a transfer audit trail |

#### Why the connector matters

Without a data connector the data lake is a raw S3 bucket and the GAIA-X / ODRL policies in the metadata are descriptive only — nothing enforces them at transfer time. The EDC is what makes the data space real:

- `dcat:accessURL` in each dataset distribution points to the **connector endpoint**, not the raw lake URL
- Any cross-node data access goes through an EDC-to-EDC contract negotiation before a byte is transferred
- Usage policies (`gax:policy`, ODRL) attached to datasets in piveau-hub are evaluated by the connector at negotiation time
- Every transfer is logged and can be registered back in piveau-hub as a `prov:Activity` record for provenance and audit

---

## 4. Central "datalab-openop" Components

| Component | Role |
|---|---|
| **Central piveau-hub** | Harvests DCAT-AP metadata (datasets + services) from all operator nodes; exposes it to external systems and data spaces |
| **Node index** | Registry of all participating OpenOP nodes and their catalogue URLs; used by the harvester to know where to pull from |

The central layer has **no user-facing DataOps role** and **no orchestrator**. It exists for federation, interoperability, and reporting only.

### Who uses the central catalogue

| Consumer | How |
|---|---|
| GAIA-X Federated Catalogue | Harvests self-descriptions for cross-data-space interoperability |
| SLICES-RI MRS | Cross-registration of datasets via the existing AF-MRS integration |
| Consortium / SNS-JU reporting | Queries the federated catalogue for project-level statistics |
| External researchers / auditors | Browse or query what datasets and services exist across the federation |
| Other EU data spaces | DCAT-AP harvest or SPARQL federation |

---

## 5. Data Flow

### 5.1 Dataset Registration (Operator → Central)

```
Testbed generates data


Uploaded to local data lake


Metadata registered in local piveau-hub
(DCAT-AP dataset record with CMT fields, testbed context,
 GAIA-X compliance, provenance, etc.)

        ▼  (periodic DCAT-AP harvest)
Central piveau-hub indexes the dataset
→ Dataset is now visible to external systems and other data spaces
```

### 5.2 Pipeline Execution (entirely within the node)

```
User opens Node A's DataOps UI

        ├─ browses local datasets (from local piveau-hub / Airflow dataset registry)
        ├─ browses available services (from local task library)


User composes pipeline:
  - selects dataset D
  - selects one or more services (e.g. quality check → feature extraction)
  - submits


Node A's Orchestrator triggers local Airflow DAG
  conf: {
    "dataset_id": "D",
    "dataset_local_url": "s3://lake-a/datasets/D/data.csv",
    "output_dataset_id": "derived-D-xyz",
    "pipeline_params": { ... }
  }


Node A's Airflow executes pipeline locally
  - Reads from local lake
  - Runs service tasks against local data
  - Writes derived dataset to local lake
  - Registers derived dataset in local piveau-hub
    (prov:wasDerivedFrom links back to input dataset D)

        ▼  (next harvest cycle — background, transparent to user)
Central piveau-hub harvests the derived dataset record
→ visible to external systems and other data spaces
```

### 5.3 Status Tracking

The user monitors job status in **Node A's DataOps UI**. The central catalogue is not involved in execution or monitoring.