Clean up branch: keep only etsi_dq and api files (6461358f) · Commits · TC DATA - Data Solutions / Proofs of concept / Data Quality Assessment

.gitignore

+4 −31

Original line number	Original line	Diff line number	Diff line

	# CSV test data (can be generated by create_sample_data.py)
	*.csv

	# Great Expectations uncommitted (local execution results)
	uncommitted/
	gx/uncommitted/

	# Python
	__pycache__/		__pycache__/
	*.py[cod]		*.pyc
	*$py.class
	*.so
	.Python
	*.egg-info/
	dist/
	build/

	# Jupyter Notebook
	.ipynb_checkpoints

	# Environment
	.env		.env
	.venv		*.xlsx~
	venv/		~$*

	# IDE
	.vscode/
	.idea/
	*.swp
	*.swo

	# macOS
	.DS_Store		.DS_Store
			data/

README.md

+16 −233

Original line number	Original line	Diff line number	Diff line
	# ETSI Data Quality Assessment Project		# ETSI Data Quality Assessment System

	Implementation of data quality metrics based on ETSI TR 104 180 standard using Great Expectations framework.		Web-based data quality validation tool based on the ETSI data quality framework.

	## Overview		## Structure

	This project provides a comprehensive implementation of data quality metrics defined in ETSI TR 104 180, integrating Great Expectations validation capabilities with precise ETSI formula calculations.		- `etsi_dq/` — Core analysis library (6 metrics)
			- `api/` — FastAPI backend (14 REST API endpoints)
	## Implemented Metrics		- `dashboard.py` — Streamlit frontend

	### 1. Completeness - ETSI 5.1
	- Formula: `C_D = (Non-null cells / Total cells) × 100`
	- Measures the proportion of non-missing values in the dataset

	### 2. Accuracy - ETSI 5.4
	- Formula: `Accuracy = (Correct values / Total values) × 100`
	- Validates data against specified ranges, formats, or reference values

	### 3. Consistency - ETSI 5.6
	- Formula: `Consistency = (Consistent values / Total values) × 100`
	- Checks adherence to predefined value sets or business rules

	## Key Features

	- Dual Validation Approach: Combines Great Expectations validation with ETSI formula calculations
	- Comprehensive Reporting: Detailed output showing both GX validation results and ETSI metric scores
	- Null-Inclusive Calculation: Follows ETSI standard by including null values in total count
	- Tolerance Support: Accuracy checks support configurable tolerance levels for numerical comparisons

	## Installation

	### Prerequisites
	- Python 3.9 or higher
	- Conda or virtualenv

	### Setup
	```bash
	# Create virtual environment
	conda create -n ge_test python=3.9
	conda activate ge_test

	# Install required packages
	pip install great-expectations pandas
	```

	## Usage

	### 1. Generate Test Data
	```bash
	python create_sample_data.py
	```

	This creates three CSV files for testing:
	- `pump_data.csv` - Completeness test (with missing values)
	- `products.csv` - Accuracy test (weight comparison)
	- `product_categories.csv` - Consistency test (category validation)

	### 2. Run Data Quality Assessment
	```bash
	python data_quality_assessment.py
	```

	## Project Structure
	```
	gx/
	├── data_quality_assessment.py # Main assessment script
	├── create_sample_data.py # Test data generator
	├── init_gx_project.py # GX project initialization
	├── gx/ # Great Expectations configuration
	│ ├── great_expectations.yml
	│ ├── checkpoints/
	│ ├── expectations/
	│ └── plugins/
	├── README.md
	└── .gitignore
	```

	## Example Output
	```
	Test 1: Completeness (Pump Performance Data)
	============================================================
	Dataset: pump_data.csv
	Size: 4 rows × 5 columns

	=== Completeness ===
	[GX Validation]
	Timestamp : Pass
	PumpID : Pass
	Temperature_C : Fail
	Vibration_mm_s : Fail
	Pressure_kPa : Pass

	[ETSI Calculation]
	Total cells: 20 (4 rows × 5 cols)
	Non-null values: 18
	Missing values: 2
	Completeness score: 90.00%
	Formula: C_D = (Non-null / Total) × 100
	============================================================

	Test 2: Accuracy (Product Weight Data)
	============================================================
	=== Accuracy: ListedWeight_kg vs TrueWeight_kg ===

	[ETSI Calculation]
	Total products: 4
	Accurate weight: 3
	Inaccurate weight: 1
	Tolerance: ±0.1 kg
	Accuracy score: 75.00%

	[Inaccurate Products Details]
	PROD-B: Listed=10.2kg, True=10.5kg, Diff=0.30kg
	============================================================

	Test 3: Consistency (Product Category Data)
	============================================================
	=== Consistency: Category ===

	[GX Validation]
	Result: Fail
	Allowed values: ['Electronics', 'Furniture', 'Clothing']

	[ETSI Calculation]
	Total values: 4
	Consistent values: 3
	Inconsistent values: 1
	Consistency score: 75.00%
	============================================================
	```

	## Implementation Details

	### GX vs ETSI Calculation

	\| Aspect \| GX Approach \| ETSI Approach \| Implementation \|
	\|:-------\|:-----------\|:-------------\|:--------------\|
	\| Denominator \| Non-null values only \| All values (including nulls) \| ETSI formula applied \|
	\| Null Handling \| Excluded from calculation \| Counted as missing/incorrect \| Nulls included \|
	\| Result Format \| Pass/Fail + count \| Percentage score \| Both provided \|

	### Code Architecture
	```python
	# Each metric function follows this pattern:

	def etsi_metric(validator, df, ...):
	# 1. GX validation for Pass/Fail
	result = validator.expect_column_values_to_...()

	# 2. ETSI formula calculation
	total_values = len(df[column]) # Include nulls
	correct_values = ...
	etsi_score = (correct_values / total_values) * 100

	# 3. Return both results
	return {
	"etsi_score": round(etsi_score, 2),
	"gx_validation": result.success
	}
	```

	## Technical References

	- ETSI TR 104 180: Data Quality Metrics and Scoring Methods
	- Section 5.1: Completeness
	- Section 5.4: Accuracy
	- Section 5.6: Consistency
	- Great Expectations: [Official Documentation](https://docs.greatexpectations.io/)

	## Configuration

	### Customize Metrics

	Edit `data_quality_assessment.py` to modify assessment parameters:
	```python
	# Completeness
	pump_config = {
	'required_columns': ['Timestamp', 'PumpID', 'Temperature_C']
	}

	# Accuracy
	products_config = {
	'accuracy_weight_comparison_check': {
	'listed_col': 'ListedWeight_kg',
	'true_col': 'TrueWeight_kg',
	'tolerance': 0.1 # Adjust tolerance
	}
	}

	# Consistency
	category_config = {
	'consistency_checks': {
	'Category': ['Electronics', 'Furniture', 'Clothing']
	}
	}
	```

	## Testing

	The project includes automated test data generation:
	```bash
	# Generate fresh test data
	python create_sample_data.py

	# Run assessments
	python data_quality_assessment.py
	```

	## Data Quality Metrics Summary

	\| Metric \| ETSI Formula \| GX Support \| Custom Implementation \|
	\|:-------\|:------------\|:-----------\|:---------------------\|
	\| Completeness \| `C = Non-null / Total × 100` \| Partial \| Full \|
	\| Accuracy \| `Acc = Correct / Total × 100` \| Partial \| Full \|
	\| Consistency \| `CS = Consistent / Total × 100` \| Partial \| Full \|

	Legend:
	- Full: Complete ETSI formula implementation
	- Partial: GX provides validation, custom code adds ETSI calculation

	## Contributing

	Contributions are welcome. Please submit a Pull Request with detailed description of changes.

	## Author

	Jayeon Pyo, JaeSeung Song

	## Acknowledgments

	- ETSI for defining comprehensive data quality standards
	- Great Expectations team for the data validation framework

	---

	Note: CSV files are excluded from version control. Generate test data using `create_sample_data.py` before running assessments.

			## Run
			Backend
			uvicorn api.main:app --reload --port 8000
			Frontend
			streamlit run dashboard.py
			## Tech Stack

			- Frontend: Streamlit
			- Backend: FastAPI + Uvicorn
			- Analysis: Python, pandas, numpy
			- API Docs: http://localhost:8000/docs

checkpoints/customers_checkpoint.yml

deleted100644 → 0

+0 −24

Original line number	Original line	Diff line number	Diff line
	name: customers_checkpoint
	config_version: 1.0
	template_name:
	module_name: great_expectations.checkpoint
	class_name: SimpleCheckpoint
	run_name_template: '%Y%m%d-%H%M%S-customers'
	expectation_suite_name:
	batch_request: {}
	action_list:
	- name: store_validation_result
	action:
	class_name: StoreValidationResultAction
	- name: store_evaluation_params
	action:
	class_name: StoreEvaluationParametersAction
	- name: update_data_docs
	action:
	class_name: UpdateDataDocsAction
	evaluation_parameters: {}
	runtime_configuration: {}
	validations: []
	profilers: []
	ge_cloud_id:
	expectation_suite_ge_cloud_id:

checkpoints/products_checkpoint.yml

deleted100644 → 0

+0 −24

Original line number	Original line	Diff line number	Diff line
	name: products_checkpoint
	config_version: 1.0
	template_name:
	module_name: great_expectations.checkpoint
	class_name: SimpleCheckpoint
	run_name_template: '%Y%m%d-%H%M%S-products'
	expectation_suite_name:
	batch_request: {}
	action_list:
	- name: store_validation_result
	action:
	class_name: StoreValidationResultAction
	- name: store_evaluation_params
	action:
	class_name: StoreEvaluationParametersAction
	- name: update_data_docs
	action:
	class_name: UpdateDataDocsAction
	evaluation_parameters: {}
	runtime_configuration: {}
	validations: []
	profilers: []
	ge_cloud_id:
	expectation_suite_ge_cloud_id:

create_sample_data.py

deleted100644 → 0

+0 −111

Original line number	Original line	Diff line number	Diff line
	# create_sample_data.py
	# Generates sample data based on examples from ETSI TR 104 180.

	import pandas as pd

	print("Generating sample data for ETSI Data Quality Metrics...\n")

	# ============================================
	# 1. Completeness Example (ETSI 5.1.3)
	# ============================================
	print("1. Completeness - Generating pump performance data...")

	# Pump operation data (with some missing values)
	pump_data = [
	[1, "2024-06-25 10:00:00", "P001", 45.2, 1.5, 500],
	[2, "2024-06-25 10:05:00", "P001", 45.8, 1.7, 505],
	[3, "2024-06-25 10:10:00", "P001", 46.1, None, 510], # Missing Vibration
	[4, "2024-06-25 10:15:00", "P001", None, 1.6, 508], # Missing Temperature
	[5, "2024-06-25 10:20:00", "P001", 47.0, 1.8, None], # Missing Pressure
	]

	df_pump = pd.DataFrame(pump_data,
	columns=['Index', 'Timestamp', 'PumpID',
	'Temperature_C', 'Vibration_mm_s', 'Pressure_kPa'])
	df_pump.to_csv('pump_data.csv', index=False)
	print(" ✓ Created pump_data.csv (5 rows, 3 missing values)")

	# ============================================
	# 2. Accuracy Example (ETSI 5.4.3)
	# ============================================
	print("\n2. Accuracy - Generating product weight data...")

	# Product weight accuracy data
	product_data = [
	["P-1001", "Laptop ABC", 1.5, 1.5, True],
	["P-1002", "Mouse XYZ", 2.3, 2.1, False], # Weight difference
	["P-1003", "Keyboard Pro", 0.5, 0.5, True],
	["P-1004", "Monitor Ultra", 8.7, 8.7, True],
	["P-1005", "Tablet Max", 1.2, 1.0, False], # Weight difference
	["P-1006", "Speaker Set", 15.0, 15.5, False], # Weight difference
	["P-1007", "Webcam HD", 3.4, 3.4, True],
	["P-1008", "Headset Pro", 0.8, 0.8, True],
	["P-1009", "Router Fast", 5.5, 5.0, False], # Weight difference
	["P-1010", "Charger Quick", 2.0, 2.0, True]
	]

	df_products = pd.DataFrame(product_data,
	columns=['ProductID', 'ProductName', 'ListedWeight_kg',
	'TrueWeight_kg', 'IsAccurate'])
	df_products.to_csv('products.csv', index=False)
	print(" ✓ Created products.csv (10 rows, 4 inaccurate values)")

	# ============================================
	# 3. Consistency Example (ETSI 5.6.3)
	# ============================================
	print("\n3. Consistency - Generating product category data...")

	# Product category data (with format inconsistencies)
	category_data = [
	["P-001", "Laptop ABC", "Electronics", 999.00, 899.00, 50],
	["P-002", "Mouse XYZ", "Electronics", 25.00, 30.00, 100], # Price > DiscountPrice
	["P-003", "Keyboard Pro", "Electronics", 75.00, 65.00, 75],
	["P-004", "Monitor Ultra", "ELECTRONICS", 300.00, 280.00, 25], # Uppercase error
	["P-005", "Desk Chair", "Furniture", 250.00, 225.00, 30],
	["P-006", "Office Desk", "furniture", 450.00, 400.00, 15], # Lowercase error
	["P-007", "T-Shirt", "Clothing", 29.99, 24.99, 200],
	["P-008", "Jeans", "Apparel", 79.99, 69.99, 150], # Incorrect category
	["P-009", "Sneakers", "Clothing", 89.99, 79.99, 100],
	["P-010", "Backpack", "Accessories", 59.99, 49.99, 80] # Incorrect category
	]

	df_categories = pd.DataFrame(category_data,
	columns=['ProductID', 'ProductName', 'Category',
	'Price', 'DiscountPrice', 'StockQuantity'])
	df_categories.to_csv('product_categories.csv', index=False)
	print(" ✓ Created product_categories.csv (10 rows, 4 inconsistent values)")

	# ============================================
	# 4. Comprehensive Test Data (All issues)
	# ============================================
	print("\n4. Comprehensive - Generating customer data (with all issues)...")

	customer_data = [
	[1, "John Doe", 35, "john@example.com", "2023-05-15", "Active"],
	[2, "Jane Smith", None, "jane@example.com", "2023-06-20", "Active"], # Missing Age
	[3, "Bob Johnson", 28, "bob@invalid", "2023-07-10", "Active"], # Email format error
	[4, "Alice Lee", 45, "alice@example.com", "Inactive"],
	[5, "John Doe", 35, "john@example.com", "2023-05-15", "Active"], # Duplicate row
	[6, "Eve Adams", 150, "eve@example.com", "2023-09-12", "Active"], # Age out of range
	[7, "Mike Brown", 22, "mike@example.com", "2023-10-01", "Pending"],
	[8, "Sara White", None, None, "2023-11-18", "Active"], # Multiple missing values
	[9, "Tom Green", 31, "tom@example.com", "2023-12-22", "active"], # Status inconsistency
	[10, "Lily Black", 40, "lily@invalidcom", "invalid-date", "Unknown"] # Multiple errors
	]

	df_customers = pd.DataFrame(customer_data,
	columns=['CustomerID', 'Name', 'Age', 'Email',
	'PurchaseDate', 'Status'])
	df_customers.to_csv('customers.csv', index=False)
	print(" ✓ Created customers.csv (10 rows, various quality issues)")

	print("\n" + "="*60)
	print("All datasets created successfully!")
	print("="*60)
	print("\nGenerated files:")
	print(" 1. pump_data.csv - For Completeness test")
	print(" 2. products.csv - For Accuracy test")
	print(" 3. product_categories.csv - For Consistency test")
	print(" 4. customers.csv - For Comprehensive test")
	print("\nRun the test with the following command:")
	print(" python data_quality_assessment.py")
	No newline at end of file