Commit 1c03dcb7 authored by JaeSeung Song's avatar JaeSeung Song
Browse files

Merge branch 'etsi-dq-lib' into 'main'

Updated quality metrics validation tool

See merge request !1
parents 56cf682c 6461358f
Loading
Loading
Loading
Loading
+4 −31
Original line number Diff line number Diff line

# CSV test data (can be generated by create_sample_data.py)
*.csv

# Great Expectations uncommitted (local execution results)
uncommitted/
gx/uncommitted/

# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
*.egg-info/
dist/
build/

# Jupyter Notebook
.ipynb_checkpoints

# Environment
*.pyc
.env
.venv
venv/

# IDE
.vscode/
.idea/
*.swp
*.swo

# macOS
*.xlsx~
~$*
.DS_Store
data/
+16 −233
Original line number Diff line number Diff line
# ETSI Data Quality Assessment Project
# ETSI Data Quality Assessment System

Implementation of data quality metrics based on ETSI TR 104 180 standard using Great Expectations framework.
Web-based data quality validation tool based on the ETSI data quality framework.

## Overview
## Structure

This project provides a comprehensive implementation of data quality metrics defined in ETSI TR 104 180, integrating Great Expectations validation capabilities with precise ETSI formula calculations.

## Implemented Metrics

### 1. Completeness - ETSI 5.1
- **Formula**: `C_D = (Non-null cells / Total cells) × 100`
- Measures the proportion of non-missing values in the dataset

### 2. Accuracy - ETSI 5.4
- **Formula**: `Accuracy = (Correct values / Total values) × 100`
- Validates data against specified ranges, formats, or reference values

### 3. Consistency - ETSI 5.6
- **Formula**: `Consistency = (Consistent values / Total values) × 100`
- Checks adherence to predefined value sets or business rules

## Key Features

- **Dual Validation Approach**: Combines Great Expectations validation with ETSI formula calculations
- **Comprehensive Reporting**: Detailed output showing both GX validation results and ETSI metric scores
- **Null-Inclusive Calculation**: Follows ETSI standard by including null values in total count
- **Tolerance Support**: Accuracy checks support configurable tolerance levels for numerical comparisons

## Installation

### Prerequisites
- Python 3.9 or higher
- Conda or virtualenv

### Setup
```bash
# Create virtual environment
conda create -n ge_test python=3.9
conda activate ge_test

# Install required packages
pip install great-expectations pandas
```

## Usage

### 1. Generate Test Data
```bash
python create_sample_data.py
```

This creates three CSV files for testing:
- `pump_data.csv` - Completeness test (with missing values)
- `products.csv` - Accuracy test (weight comparison)
- `product_categories.csv` - Consistency test (category validation)

### 2. Run Data Quality Assessment
```bash
python data_quality_assessment.py
```

## Project Structure
```
gx/
├── data_quality_assessment.py  # Main assessment script
├── create_sample_data.py       # Test data generator
├── init_gx_project.py          # GX project initialization
├── gx/                         # Great Expectations configuration
│   ├── great_expectations.yml
│   ├── checkpoints/
│   ├── expectations/
│   └── plugins/
├── README.md
└── .gitignore
```

## Example Output
```
Test 1: Completeness (Pump Performance Data)
============================================================
Dataset: pump_data.csv
Size: 4 rows × 5 columns

=== Completeness ===
[GX Validation]
  Timestamp           : Pass
  PumpID              : Pass
  Temperature_C       : Fail
  Vibration_mm_s      : Fail
  Pressure_kPa        : Pass

[ETSI Calculation]
  Total cells: 20 (4 rows × 5 cols)
  Non-null values: 18
  Missing values: 2
  Completeness score: 90.00%
  Formula: C_D = (Non-null / Total) × 100
============================================================

Test 2: Accuracy (Product Weight Data)
============================================================
=== Accuracy: ListedWeight_kg vs TrueWeight_kg ===

[ETSI Calculation]
  Total products: 4
  Accurate weight: 3
  Inaccurate weight: 1
  Tolerance: ±0.1 kg
  Accuracy score: 75.00%

  [Inaccurate Products Details]
    PROD-B: Listed=10.2kg, True=10.5kg, Diff=0.30kg
============================================================

Test 3: Consistency (Product Category Data)
============================================================
=== Consistency: Category ===

[GX Validation]
  Result: Fail
  Allowed values: ['Electronics', 'Furniture', 'Clothing']

[ETSI Calculation]
  Total values: 4
  Consistent values: 3
  Inconsistent values: 1
  Consistency score: 75.00%
============================================================
```

## Implementation Details

### GX vs ETSI Calculation

| Aspect | GX Approach | ETSI Approach | Implementation |
|:-------|:-----------|:-------------|:--------------|
| Denominator | Non-null values only | All values (including nulls) | ETSI formula applied |
| Null Handling | Excluded from calculation | Counted as missing/incorrect | Nulls included |
| Result Format | Pass/Fail + count | Percentage score | Both provided |

### Code Architecture
```python
# Each metric function follows this pattern:

def etsi_metric(validator, df, ...):
    # 1. GX validation for Pass/Fail
    result = validator.expect_column_values_to_...()
    
    # 2. ETSI formula calculation
    total_values = len(df[column])  # Include nulls
    correct_values = ...
    etsi_score = (correct_values / total_values) * 100
    
    # 3. Return both results
    return {
        "etsi_score": round(etsi_score, 2),
        "gx_validation": result.success
    }
```

## Technical References

- **ETSI TR 104 180**: Data Quality Metrics and Scoring Methods
  - Section 5.1: Completeness
  - Section 5.4: Accuracy
  - Section 5.6: Consistency
- **Great Expectations**: [Official Documentation](https://docs.greatexpectations.io/)

## Configuration

### Customize Metrics

Edit `data_quality_assessment.py` to modify assessment parameters:
```python
# Completeness
pump_config = {
    'required_columns': ['Timestamp', 'PumpID', 'Temperature_C']
}

# Accuracy
products_config = {
    'accuracy_weight_comparison_check': {
        'listed_col': 'ListedWeight_kg',
        'true_col': 'TrueWeight_kg',
        'tolerance': 0.1  # Adjust tolerance
    }
}

# Consistency
category_config = {
    'consistency_checks': {
        'Category': ['Electronics', 'Furniture', 'Clothing']
    }
}
```

## Testing

The project includes automated test data generation:
```bash
# Generate fresh test data
python create_sample_data.py

# Run assessments
python data_quality_assessment.py
```

## Data Quality Metrics Summary

| Metric | ETSI Formula | GX Support | Custom Implementation |
|:-------|:------------|:-----------|:---------------------|
| Completeness | `C = Non-null / Total × 100` | Partial | Full |
| Accuracy | `Acc = Correct / Total × 100` | Partial | Full |
| Consistency | `CS = Consistent / Total × 100` | Partial | Full |

**Legend**:
- Full: Complete ETSI formula implementation
- Partial: GX provides validation, custom code adds ETSI calculation

## Contributing

Contributions are welcome. Please submit a Pull Request with detailed description of changes.

## Author

Jayeon Pyo, JaeSeung Song

## Acknowledgments

- ETSI for defining comprehensive data quality standards
- Great Expectations team for the data validation framework

---

**Note**: CSV files are excluded from version control. Generate test data using `create_sample_data.py` before running assessments.
- `etsi_dq/` — Core analysis library (6 metrics)
- `api/` — FastAPI backend (14 REST API endpoints)
- `dashboard.py` — Streamlit frontend

## Run
Backend
uvicorn api.main:app --reload --port 8000
Frontend
streamlit run dashboard.py
## Tech Stack

- Frontend: Streamlit
- Backend: FastAPI + Uvicorn
- Analysis: Python, pandas, numpy
- API Docs: http://localhost:8000/docs
+0 −0

File moved.

api/main.py

0 → 100644
+86 −0
Original line number Diff line number Diff line
"""
ETSI DQ - FastAPI Backend
=========================

프로젝트 구조:

etsi_dq/                  ← 기존 코드 (그대로 유지)
├── metrics/
│   ├── completeness.py
│   ├── accuracy.py
│   ├── consistency.py
│   └── ...
├── pipeline.py
├── schemas.py
├── profiler.py
├── io.py
└── ...

api/                      ← 새로 추가하는 백엔드 계층
├── __init__.py
├── main.py               ← FastAPI 앱 진입점 (이 파일)
├── routers/
│   ├── __init__.py
│   ├── datasets.py       ← 데이터셋 업로드/조회 API
│   ├── metrics.py        ← 지표 설정/pre-check API
│   ├── analysis.py       ← 분석 실행 API
│   └── results.py        ← 결과 조회 API
├── schemas/
│   ├── __init__.py
│   ├── datasets.py       ← 데이터셋 관련 요청/응답 모델
│   ├── metrics.py        ← 지표 설정 관련 모델
│   ├── analysis.py       ← 분석 요청/응답 모델
│   └── results.py        ← 결과 응답 모델
└── services/
    ├── __init__.py
    ├── dataset_service.py  ← 기존 io.py 함수를 감싸는 서비스
    ├── metric_service.py   ← 기존 precheck/설정 로직 감싸는 서비스
    └── analysis_service.py ← 기존 pipeline.py의 run_analysis 감싸는 서비스

실행 방법:
    uvicorn api.main:app --reload --port 8000

API 문서 자동 생성:
    http://localhost:8000/docs  (Swagger UI)
    http://localhost:8000/redoc (ReDoc)
"""

from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

from api.routers import datasets, metrics, analysis, results


app = FastAPI(
    title="ETSI Data Quality API",
    description="ETSI 데이터 품질 평가 시스템 백엔드 API",
    version="0.1.0",
)

# CORS 설정 - Streamlit(8501) 또는 React(3000)에서 접근 허용
app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "http://localhost:8501",   # Streamlit 기본 포트
        "http://localhost:3000",   # React 개발 서버 포트 (나중에)
    ],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

# 라우터 등록
app.include_router(datasets.router, prefix="/api/datasets", tags=["datasets"])
app.include_router(metrics.router,  prefix="/api/metrics",  tags=["metrics"])
app.include_router(analysis.router, prefix="/api/analysis", tags=["analysis"])
app.include_router(results.router,  prefix="/api/results",  tags=["results"])


@app.get("/")
def root():
    return {"message": "ETSI Data Quality API", "version": "0.1.0"}


@app.get("/health")
def health_check():
    return {"status": "ok"}
+0 −0

Empty file added.

Loading