Commit 6461358f authored by o's avatar o
Browse files

Clean up branch: keep only etsi_dq and api files

parent c775bcf2
Loading
Loading
Loading
Loading
+4 −31
Original line number Original line Diff line number Diff line

# CSV test data (can be generated by create_sample_data.py)
*.csv

# Great Expectations uncommitted (local execution results)
uncommitted/
gx/uncommitted/

# Python
__pycache__/
__pycache__/
*.py[cod]
*.pyc
*$py.class
*.so
.Python
*.egg-info/
dist/
build/

# Jupyter Notebook
.ipynb_checkpoints

# Environment
.env
.env
.venv
*.xlsx~
venv/
~$*

# IDE
.vscode/
.idea/
*.swp
*.swo

# macOS
.DS_Store
.DS_Store
data/
+16 −233
Original line number Original line Diff line number Diff line
# ETSI Data Quality Assessment Project
# ETSI Data Quality Assessment System


Implementation of data quality metrics based on ETSI TR 104 180 standard using Great Expectations framework.
Web-based data quality validation tool based on the ETSI data quality framework.


## Overview
## Structure


This project provides a comprehensive implementation of data quality metrics defined in ETSI TR 104 180, integrating Great Expectations validation capabilities with precise ETSI formula calculations.
- `etsi_dq/` — Core analysis library (6 metrics)

- `api/` — FastAPI backend (14 REST API endpoints)
## Implemented Metrics
- `dashboard.py` — Streamlit frontend

### 1. Completeness - ETSI 5.1
- **Formula**: `C_D = (Non-null cells / Total cells) × 100`
- Measures the proportion of non-missing values in the dataset

### 2. Accuracy - ETSI 5.4
- **Formula**: `Accuracy = (Correct values / Total values) × 100`
- Validates data against specified ranges, formats, or reference values

### 3. Consistency - ETSI 5.6
- **Formula**: `Consistency = (Consistent values / Total values) × 100`
- Checks adherence to predefined value sets or business rules

## Key Features

- **Dual Validation Approach**: Combines Great Expectations validation with ETSI formula calculations
- **Comprehensive Reporting**: Detailed output showing both GX validation results and ETSI metric scores
- **Null-Inclusive Calculation**: Follows ETSI standard by including null values in total count
- **Tolerance Support**: Accuracy checks support configurable tolerance levels for numerical comparisons

## Installation

### Prerequisites
- Python 3.9 or higher
- Conda or virtualenv

### Setup
```bash
# Create virtual environment
conda create -n ge_test python=3.9
conda activate ge_test

# Install required packages
pip install great-expectations pandas
```

## Usage

### 1. Generate Test Data
```bash
python create_sample_data.py
```

This creates three CSV files for testing:
- `pump_data.csv` - Completeness test (with missing values)
- `products.csv` - Accuracy test (weight comparison)
- `product_categories.csv` - Consistency test (category validation)

### 2. Run Data Quality Assessment
```bash
python data_quality_assessment.py
```

## Project Structure
```
gx/
├── data_quality_assessment.py  # Main assessment script
├── create_sample_data.py       # Test data generator
├── init_gx_project.py          # GX project initialization
├── gx/                         # Great Expectations configuration
│   ├── great_expectations.yml
│   ├── checkpoints/
│   ├── expectations/
│   └── plugins/
├── README.md
└── .gitignore
```

## Example Output
```
Test 1: Completeness (Pump Performance Data)
============================================================
Dataset: pump_data.csv
Size: 4 rows × 5 columns

=== Completeness ===
[GX Validation]
  Timestamp           : Pass
  PumpID              : Pass
  Temperature_C       : Fail
  Vibration_mm_s      : Fail
  Pressure_kPa        : Pass

[ETSI Calculation]
  Total cells: 20 (4 rows × 5 cols)
  Non-null values: 18
  Missing values: 2
  Completeness score: 90.00%
  Formula: C_D = (Non-null / Total) × 100
============================================================

Test 2: Accuracy (Product Weight Data)
============================================================
=== Accuracy: ListedWeight_kg vs TrueWeight_kg ===

[ETSI Calculation]
  Total products: 4
  Accurate weight: 3
  Inaccurate weight: 1
  Tolerance: ±0.1 kg
  Accuracy score: 75.00%

  [Inaccurate Products Details]
    PROD-B: Listed=10.2kg, True=10.5kg, Diff=0.30kg
============================================================

Test 3: Consistency (Product Category Data)
============================================================
=== Consistency: Category ===

[GX Validation]
  Result: Fail
  Allowed values: ['Electronics', 'Furniture', 'Clothing']

[ETSI Calculation]
  Total values: 4
  Consistent values: 3
  Inconsistent values: 1
  Consistency score: 75.00%
============================================================
```

## Implementation Details

### GX vs ETSI Calculation

| Aspect | GX Approach | ETSI Approach | Implementation |
|:-------|:-----------|:-------------|:--------------|
| Denominator | Non-null values only | All values (including nulls) | ETSI formula applied |
| Null Handling | Excluded from calculation | Counted as missing/incorrect | Nulls included |
| Result Format | Pass/Fail + count | Percentage score | Both provided |

### Code Architecture
```python
# Each metric function follows this pattern:

def etsi_metric(validator, df, ...):
    # 1. GX validation for Pass/Fail
    result = validator.expect_column_values_to_...()
    
    # 2. ETSI formula calculation
    total_values = len(df[column])  # Include nulls
    correct_values = ...
    etsi_score = (correct_values / total_values) * 100
    
    # 3. Return both results
    return {
        "etsi_score": round(etsi_score, 2),
        "gx_validation": result.success
    }
```

## Technical References

- **ETSI TR 104 180**: Data Quality Metrics and Scoring Methods
  - Section 5.1: Completeness
  - Section 5.4: Accuracy
  - Section 5.6: Consistency
- **Great Expectations**: [Official Documentation](https://docs.greatexpectations.io/)

## Configuration

### Customize Metrics

Edit `data_quality_assessment.py` to modify assessment parameters:
```python
# Completeness
pump_config = {
    'required_columns': ['Timestamp', 'PumpID', 'Temperature_C']
}

# Accuracy
products_config = {
    'accuracy_weight_comparison_check': {
        'listed_col': 'ListedWeight_kg',
        'true_col': 'TrueWeight_kg',
        'tolerance': 0.1  # Adjust tolerance
    }
}

# Consistency
category_config = {
    'consistency_checks': {
        'Category': ['Electronics', 'Furniture', 'Clothing']
    }
}
```

## Testing

The project includes automated test data generation:
```bash
# Generate fresh test data
python create_sample_data.py

# Run assessments
python data_quality_assessment.py
```

## Data Quality Metrics Summary

| Metric | ETSI Formula | GX Support | Custom Implementation |
|:-------|:------------|:-----------|:---------------------|
| Completeness | `C = Non-null / Total × 100` | Partial | Full |
| Accuracy | `Acc = Correct / Total × 100` | Partial | Full |
| Consistency | `CS = Consistent / Total × 100` | Partial | Full |

**Legend**:
- Full: Complete ETSI formula implementation
- Partial: GX provides validation, custom code adds ETSI calculation

## Contributing

Contributions are welcome. Please submit a Pull Request with detailed description of changes.

## Author

Jayeon Pyo, JaeSeung Song

## Acknowledgments

- ETSI for defining comprehensive data quality standards
- Great Expectations team for the data validation framework

---

**Note**: CSV files are excluded from version control. Generate test data using `create_sample_data.py` before running assessments.


## Run
Backend
uvicorn api.main:app --reload --port 8000
Frontend
streamlit run dashboard.py
## Tech Stack


- Frontend: Streamlit
- Backend: FastAPI + Uvicorn
- Analysis: Python, pandas, numpy
- API Docs: http://localhost:8000/docs
+0 −24
Original line number Original line Diff line number Diff line
name: customers_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: SimpleCheckpoint
run_name_template: '%Y%m%d-%H%M%S-customers'
expectation_suite_name:
batch_request: {}
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
evaluation_parameters: {}
runtime_configuration: {}
validations: []
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:
+0 −24
Original line number Original line Diff line number Diff line
name: products_checkpoint
config_version: 1.0
template_name:
module_name: great_expectations.checkpoint
class_name: SimpleCheckpoint
run_name_template: '%Y%m%d-%H%M%S-products'
expectation_suite_name:
batch_request: {}
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: store_evaluation_params
    action:
      class_name: StoreEvaluationParametersAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
evaluation_parameters: {}
runtime_configuration: {}
validations: []
profilers: []
ge_cloud_id:
expectation_suite_ge_cloud_id:

create_sample_data.py

deleted100644 → 0
+0 −111
Original line number Original line Diff line number Diff line
# create_sample_data.py
# Generates sample data based on examples from ETSI TR 104 180.

import pandas as pd

print("Generating sample data for ETSI Data Quality Metrics...\n")

# ============================================
# 1. Completeness Example (ETSI 5.1.3)
# ============================================
print("1. Completeness - Generating pump performance data...")

# Pump operation data (with some missing values)
pump_data = [
    [1, "2024-06-25 10:00:00", "P001", 45.2, 1.5, 500],
    [2, "2024-06-25 10:05:00", "P001", 45.8, 1.7, 505],
    [3, "2024-06-25 10:10:00", "P001", 46.1, None, 510],  # Missing Vibration
    [4, "2024-06-25 10:15:00", "P001", None, 1.6, 508],   # Missing Temperature
    [5, "2024-06-25 10:20:00", "P001", 47.0, 1.8, None],  # Missing Pressure
]

df_pump = pd.DataFrame(pump_data,
                       columns=['Index', 'Timestamp', 'PumpID',
                                'Temperature_C', 'Vibration_mm_s', 'Pressure_kPa'])
df_pump.to_csv('pump_data.csv', index=False)
print("    ✓ Created pump_data.csv (5 rows, 3 missing values)")

# ============================================
# 2. Accuracy Example (ETSI 5.4.3)
# ============================================
print("\n2. Accuracy - Generating product weight data...")

# Product weight accuracy data
product_data = [
    ["P-1001", "Laptop ABC", 1.5, 1.5, True],
    ["P-1002", "Mouse XYZ", 2.3, 2.1, False],      # Weight difference
    ["P-1003", "Keyboard Pro", 0.5, 0.5, True],
    ["P-1004", "Monitor Ultra", 8.7, 8.7, True],
    ["P-1005", "Tablet Max", 1.2, 1.0, False],     # Weight difference
    ["P-1006", "Speaker Set", 15.0, 15.5, False],  # Weight difference
    ["P-1007", "Webcam HD", 3.4, 3.4, True],
    ["P-1008", "Headset Pro", 0.8, 0.8, True],
    ["P-1009", "Router Fast", 5.5, 5.0, False],    # Weight difference
    ["P-1010", "Charger Quick", 2.0, 2.0, True]
]

df_products = pd.DataFrame(product_data,
                           columns=['ProductID', 'ProductName', 'ListedWeight_kg',
                                    'TrueWeight_kg', 'IsAccurate'])
df_products.to_csv('products.csv', index=False)
print("    ✓ Created products.csv (10 rows, 4 inaccurate values)")

# ============================================
# 3. Consistency Example (ETSI 5.6.3)
# ============================================
print("\n3. Consistency - Generating product category data...")

# Product category data (with format inconsistencies)
category_data = [
    ["P-001", "Laptop ABC", "Electronics", 999.00, 899.00, 50],
    ["P-002", "Mouse XYZ", "Electronics", 25.00, 30.00, 100],      # Price > DiscountPrice
    ["P-003", "Keyboard Pro", "Electronics", 75.00, 65.00, 75],
    ["P-004", "Monitor Ultra", "ELECTRONICS", 300.00, 280.00, 25], # Uppercase error
    ["P-005", "Desk Chair", "Furniture", 250.00, 225.00, 30],
    ["P-006", "Office Desk", "furniture", 450.00, 400.00, 15],     # Lowercase error
    ["P-007", "T-Shirt", "Clothing", 29.99, 24.99, 200],
    ["P-008", "Jeans", "Apparel", 79.99, 69.99, 150],              # Incorrect category
    ["P-009", "Sneakers", "Clothing", 89.99, 79.99, 100],
    ["P-010", "Backpack", "Accessories", 59.99, 49.99, 80]         # Incorrect category
]

df_categories = pd.DataFrame(category_data,
                             columns=['ProductID', 'ProductName', 'Category',
                                      'Price', 'DiscountPrice', 'StockQuantity'])
df_categories.to_csv('product_categories.csv', index=False)
print("    ✓ Created product_categories.csv (10 rows, 4 inconsistent values)")

# ============================================
# 4. Comprehensive Test Data (All issues)
# ============================================
print("\n4. Comprehensive - Generating customer data (with all issues)...")

customer_data = [
    [1, "John Doe", 35, "john@example.com", "2023-05-15", "Active"],
    [2, "Jane Smith", None, "jane@example.com", "2023-06-20", "Active"],      # Missing Age
    [3, "Bob Johnson", 28, "bob@invalid", "2023-07-10", "Active"],           # Email format error
    [4, "Alice Lee", 45, "alice@example.com", "Inactive"],
    [5, "John Doe", 35, "john@example.com", "2023-05-15", "Active"],          # Duplicate row
    [6, "Eve Adams", 150, "eve@example.com", "2023-09-12", "Active"],         # Age out of range
    [7, "Mike Brown", 22, "mike@example.com", "2023-10-01", "Pending"],
    [8, "Sara White", None, None, "2023-11-18", "Active"],                    # Multiple missing values
    [9, "Tom Green", 31, "tom@example.com", "2023-12-22", "active"],          # Status inconsistency
    [10, "Lily Black", 40, "lily@invalidcom", "invalid-date", "Unknown"]      # Multiple errors
]

df_customers = pd.DataFrame(customer_data,
                            columns=['CustomerID', 'Name', 'Age', 'Email',
                                     'PurchaseDate', 'Status'])
df_customers.to_csv('customers.csv', index=False)
print("    ✓ Created customers.csv (10 rows, various quality issues)")

print("\n" + "="*60)
print("All datasets created successfully!")
print("="*60)
print("\nGenerated files:")
print("  1. pump_data.csv          - For Completeness test")
print("  2. products.csv           - For Accuracy test")
print("  3. product_categories.csv - For Consistency test")
print("  4. customers.csv          - For Comprehensive test")
print("\nRun the test with the following command:")
print("  python data_quality_assessment.py")
 No newline at end of file
Loading