Some SBI drivers using polling loose telemetry samples
Reporters
- Mika Silvola (INF)
- Lluis Gifre (CTTC)
Description
Some SBI drivers using APScheduler and a SamplesCache for polling-based telemetry might loose samples due to a cache eviction miss-synchronization. When setting SAMPLES_EVICTION_SEC == SAMPLING_PERIOD
, due to jitter in APScheduler timers, it might happen that samples are evicted before or after next sample collection.
If evicted before, no problem, samples are refreshed, otherwise, last sample is reported and poll against device is skipped.
See example logs:
[2024-02-22 14:30:27,236] DEBUG:monitoring.client.MonitoringClient:IncludeKpi: {"kpi_id": {"kpi_id": {"uuid": "1"}}, "kpi_value": {"floatVal": 0.0}, "timestamp": {"timestamp": 1708612217.564579}}
[2024-02-22 14:30:37,241] DEBUG:monitoring.client.MonitoringClient:IncludeKpi: {"kpi_id": {"kpi_id": {"uuid": "1"}}, "kpi_value": {"floatVal": 0.0}, "timestamp": {"timestamp": 1708612237.232386}}
[2024-02-22 14:30:47,248] DEBUG:monitoring.client.MonitoringClient:IncludeKpi: {"kpi_id": {"kpi_id": {"uuid": "1"}}, "kpi_value": {"floatVal": -1.2507533}, "timestamp": {"timestamp": 1708612247.234192}}
[2024-02-22 14:30:57,233] DEBUG:monitoring.client.MonitoringClient:IncludeKpi: {"kpi_id": {"kpi_id": {"uuid": "1"}}, "kpi_value": {"floatVal": -1.2507533}, "timestamp": {"timestamp": 1708612247.234192}}
[2024-02-22 14:31:07,233] DEBUG:monitoring.client.MonitoringClient:IncludeKpi: {"kpi_id": {"kpi_id": {"uuid": "1"}}, "kpi_value": {"floatVal": -10.006026}, "timestamp": {"timestamp": 1708612257.321009}}
[2024-02-22 14:31:17,235] DEBUG:monitoring.client.MonitoringClient:IncludeKpi: {"kpi_id": {"kpi_id": {"uuid": "1"}}, "kpi_value": {"floatVal": -10.006026}, "timestamp": {"timestamp": 1708612267.321142}}
A solution to this bug is to adjust in those cases the SAMPLES_EVICTION_SEC to be lower than the SAMPLING_PERIOD, for instance, 80-90% to ensure samples are always evicted BEFORE next telemetry collection request.
For instance, in:
def _refresh_samples(self) -> None:
with self.__lock:
try:
now = datetime.timestamp(datetime.utcnow())
if self.__timestamp is not None and (now - self.__timestamp) < SAMPLE_EVICTION_SECONDS: return
#str_filter = get_filter(SAMPLE_RESOURCE_KEY)
change if self.__timestamp is not None and (now - self.__timestamp) < SAMPLE_EVICTION_SECONDS: return
so that it uses SAMPLE_EVICTION_SECONDS * 0.8
when SAMPLE_EVICTION_SECONDS == sample_interval
.
Affects (at least):
Deployment environment
- Operating System (include version): any
- MicroK8s (include version and add-ons): any
- TeraFlowSDN (include release/branch-name/commit-id): develop
TFS deployment settings
- List of components deployed: Device
- particular configurations you applied: devices using NetConf and SampleCache polling with APScheduler
Sequence of actions that resulted in the bug
See Description
Document the explicit error
See Description
Expected behaviour
See Description
References
None