P4 driver does not correctly retrieve resources

added bugresolving comp-device p4-driver labels

created branch feat/139-ubi-p4-driver-does-not-correctly-retrieve-resources to address this issue

mentioned in merge request !217 (closed)

A fix can be found in the branch feat/139-ubi-p4-driver-does-not-correctly-retrieve-resources. The relevant MR (!217 (closed)) is marked as draft atm, in order to make sure that we do not have any further errors. For the same reasons, the current issue is kept open

I will check that once again (either in clean hackfest3 VM or "normal" TFS VM), maybe there was some merge error on my side because even though I added these changes bootstrap functional test still fails (attachment).

2024-03-18.txt

Here is the output from bootstrap functional test (TFS VM, adjusted to pyenv with Python 3.9.18): 2024-03-19.txt

@famelis how exactly did you manage to have the P4 devices onboarded to TFS and 01_bootstrap passed? A log from your execution after the patch would be helpful. The part I'm not sure about is whether you used fabric_v1model.p4 files or default main.p4 sitting in the TFS.

The onboarding was done using the main.p4 that is available in the TFS source tree. Please repeat using this P4 program and let us know

the logs that are already available are using fabric-int-v1model compilation files, the ones with which I started this thread. Common issue for these logs is this part, once again for AddDevice function:

>           response = device_client.AddDevice(Device(**device_p4_with_connect_rules))

src/tests/hackfest3/tests/test_functional_bootstrap.py:93:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/common/tools/client/RetryDecorator.py:75: in wrapper
    return func(self, *args, **kwargs)
src/device/client/DeviceClient.py:52: in AddDevice
    response = self.stub.AddDevice(request)
../.pyenv/versions/3.9.18/envs/tfs/lib/python3.9/site-packages/grpc/_channel.py:946: in __call__
    return _end_unary_response_blocking(state, call, False, None)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

state = <grpc._channel._RPCState object at 0x7f4a0ddee220>, call = <grpc._cython.cygrpc.SegregatedCall object at 0x7f4a0dde8f00>
with_call = False, deadline = None

    def _end_unary_response_blocking(state, call, with_call, deadline):
        if state.code is grpc.StatusCode.OK:
            if with_call:
                rendezvous = _MultiThreadedRendezvous(state, call, None, deadline)
                return state.response, rendezvous
            else:
                return state.response
        else:
>           raise _InactiveRpcError(state)
E           grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
E               status = StatusCode.INTERNAL
E               details = "'NoneType' object has no attribute 'p4_objects'"
E               debug_error_string = "{"created":"@1710776401.893978939","description":"Error received from peer ipv4:10.152.183.244:2020","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"'NoneType' object has no attribute 'p4_objects'","grpc_status":13}"
E           >

../.pyenv/versions/3.9.18/envs/tfs/lib/python3.9/site-packages/grpc/_channel.py:849: _InactiveRpcError

My findings so far: this error message, i.e. "A object has no attribute B" (python AttributeError) pertains to getattr function implemented in p4_manager, which is defined for _MeterConfig and _IdleTimeout classes. From the highlighted screenshot I cannot identify which class actually caused that, so I need to look closer.

For the latter log I actually applied pytest flag --full-trace to get even more verbose output, but it only gives a little more detailed insight into how pytest is instantiated during test execution, not much more info about the root cause itself.

I can see that P4Manager class defines p4_objects field, which is self-explanatory along the way.

failing gRPC method for reference: link

__getattr__ is supposed to return getattr(self._msg, name). If this object is seen as 'NoneType', there is a chance that self._msg = p4runtime_pb2.MeterConfig() is of type None (?)

This might come as irrelevant, but what I noticed as well is a behaviour of this bootstrap program that occurs at first execution after VM reboot / reinstantiation, i.e. test progress stops at 50%, as if it was waiting for something. Test rerun shows the error as it is described above.

@katsikasg @famelis Do you have any suggestions for what might be a root cause at the moment?

We need a way to reproduce this bug on our side, so as to conclude what could the problem be.

Did you build TFS based on branch feat/139-ubi-p4-driver-does-not-correctly-retrieve-resources? This is the branch you should be working on.
What p4 program did you use? What it the main.p4 in the TFS tree or some other P4 program?
Did you manage to bootstrap the devices (test_functional_bootstrap) or does the error occur prior to test_01 being successfully finished?

Yes, I am working on the branch you specified and I have deployed TFS using this branch.
I have used other P4 program, i.e.:

there is a program linked in the issue description: https://github.com/stratum/fabric-tna/blob/main/p4src/v1model/fabric_v1model.p4
I locally compiled it for v1model architecture using make fabric-int-v1model (Makefile)
compilation files (p4info.txt, bmv2.json) have been transferred into TFS and they are used instead of default main.p4 that's in the repo.

The error occurs during test_functional_bootstrap. So yes, test_01 is not successfully finished.

On the left pane there is functional test execution (with verbose output from pytest), on the right - kubectl logs for Device pod. What can be seen is that during test_devices_bootstraping method execution, where 'NoneType' object has no attribute 'p4_objects' occurs, from the device pod standpoint it leads to AddDevice exception, i.e.

File "/var/teraflow/device/service/drivers/p4/p4_driver.py", line 197, in GetConfig
    obj_name for obj_name, _ in self.__manager.p4_objects.items()
AttributeError: 'NoneType' object has no attribute 'p4_objects'

which is specifically given here.

tfs-vm-logs.log

That means that self.__manager is of type None --> P4Manager is not started. In the logs, instead of Connected via P4Runtime there is:

[2024-04-03 22:29:12,102] INFO:device.service.drivers.p4.p4_driver:Connecting to P4 device 10.0.2.4:50001 ...
[2024-04-03 22:29:14,112] CRITICAL:root:Failed to establish session with server
[2024-04-03 22:29:15,111] CRITICAL:root:StreamChannel error, closing stream
[2024-04-03 22:29:15,112] CRITICAL:root:P4Runtime RPC error (CANCELLED): Received RST_STREAM with error code 8
[2024-04-03 22:32:32,723] INFO:device.service.drivers.p4.p4_driver:Getting configuration from P4 device 10.0.2.4:50001 ...
[2024-04-03 22:32:32,724] INFO:device.service.drivers.p4.p4_driver:Getting configuration from P4 device 10.0.2.4:50001 ...
[2024-04-03 22:32:32,724] WARNING:device.service.drivers.p4.p4_driver:GetConfig with no resource keys implies getting all resource keys!
[2024-04-03 22:32:32,726] ERROR:device.service.DeviceServiceServicerImpl:AddDevice exception

p4client handles session establishment error, as defined here, as well as StreamChannel error right here.

That's a little odd that even though stream was closed, p4_driver tried GetConfig anyway

I forgot to add this comment a couple of days ago, I need to verify that but most likely that is the case. The error you actually see above comes from "vanilla" TFS VM, i.e. there are no adjustments made specifically for hackfest, and - what is important device-wise - no mininet environment. TFS VM out of the box does not include it, there is even an instruction from previous hackfest event how to set it up.

So there is a chance that simply not having mininet installed and started in the background (with the proper topology) could cause this AddDevice exception. After all, it'd be at least naive to assume that P4 device could be registered out of thin air, without the actual environment to run it. My bad for messing with the other VM than described in the issue

I know I'm not supposed to be working on different branch, but simply out of curiosity I tried to recreate this behavior in the situation where there is feat/hackfest3 branch on which changes to p4driver/p4manager are applied by hand.

Going down this road I found out that files I want to work on (fabric_v1model compilation files) pass the tests just fine. I attach the logs to back it up. 2024-04-03-tsh.log

I am still trying to troubleshoot abovementioned bug, for which I will attach separate log.

unassigned @famelis

changed the description

Hi @katsikasg, have you or Alex/Pantelis tried to recreate this setup and described problem? After a short break I will resume my activities, but let me know how it's been going on your end so far.

Hi @jakub.gorczynski.stud, we haven't tried yet but my colleagues @pmalekas and @avalantasis will let you know when we have an outcome.

For the recreation of this open issue it is also important to notice that there is a drift between the state of hackfest3 VM as it was delivered for the event and the current state of feat/hackfest3 branch that can be accessed in the repo. This is important mainly from the pytest behavior standpoint, because in the latest version of hackfest3 branch there are changes in the P4 service handler which are relevant for the last part of the demo, namely for the telemetry toy case.

hackfest3-demo-feat_139_branch.log

Therefore in the attachment you can see a log where for the main.p4 program - the very basic one presented at the beginning - the output for the functional test which handles the service will try to look for INT-related tables (because the implementation of P4 service handler contains simple JSON-based functions that operate on them) which are not included in the main.p4 program.

It will not block the mininet ping, though - just that the functional test will not be marked as PASSED and in the web ui service will have status SERVICESTATUS_PLANNED.

I am mentioning this because then it is propagated in the relevant feat/139-ubi-p4-driver-does-not-correctly-retrieve-resources branch, obviously.

Also in the log you will see that for the device bootstrap test I had some minor issues with experimental/irrelevant features, which I simply commented out in the DeviceClient:

def connect(self):
        self.channel = grpc.insecure_channel(self.endpoint)
        self.stub = DeviceServiceStub(self.channel)
        # self.openconfig_stub=OpenConfigServiceStub(self.channel)
...
# def ConfigureOpticalDevice(self, request : OpticalConfig) -> OpticalConfigId:
    #     LOGGER.debug('ConfigureOpticalDevice request: {:s}'.format(grpc_message_to_json_string(request)))
    #     response = self.openconfig_stub.ConfigureOpticalDevice(request)
    #     LOGGER.debug('ConfigureOpticalDevice result: {:s}'.format(grpc_message_to_json_string(response)))
    #     return response

Device:

# DEVICE_IETF_ACTN_TYPE    = DeviceTypeEnum.OPEN_LINE_SYSTEM.value
# DEVICE_IETF_ACTN_DRIVERS = [DeviceDriverEnum.DEVICEDRIVER_IETF_ACTN]
...
# def json_device_ietf_actn_disabled(
#         device_uuid : str, name : Optional[str] = None, endpoints : List[Dict] = [], config_rules : List[Dict] = [],
#         drivers : List[Dict] = DEVICE_IETF_ACTN_DRIVERS
#     ):
#     return json_device(
#         device_uuid, DEVICE_IETF_ACTN_TYPE, DEVICE_DISABLED, name=name, endpoints=endpoints, config_rules=config_rules,
#         drivers=drivers)

and ContextClient:

from common.proto.context_pb2 import (
    ...
    # , OpticalConfig, OpticalConfigId, OpticalConfigList
)
...
#//////////////// Experimental //////////////////

    # @RETRY_DECORATOR
    # def SetOpticalConfig(self, request : OpticalConfig) -> OpticalConfigId:
    #     LOGGER.debug('SetOpticalConfig request: {:s}'.format(grpc_message_to_json_string(request)))
    #     response = self.stub.SetOpticalConfig(request)
    #     LOGGER.debug('SetOpticalConfig result: {:s}'.format(grpc_message_to_json_string(response)))
    #     return response

    # @RETRY_DECORATOR
    # def GetOpticalConfig(self, request : Empty) -> OpticalConfigList:
    #     LOGGER.debug('GetOpticalConfig request: {:s}'.format(grpc_message_to_json_string(request)))
    #     response = self.stub.GetOpticalConfig(request)
    #     LOGGER.debug('GetOpticalConfig result: {:s}'.format(grpc_message_to_json_string(response)))
    #     return response

    # @RETRY_DECORATOR
    # def SelectOpticalConfig(self,request : OpticalConfigId) -> OpticalConfigList:
    #     LOGGER.debug('SelectOpticalConfig request: {:s}'.format(grpc_message_to_json_string(request)))
    #     response = self.stub.SelectOpticalConfig(request)
    #     LOGGER.debug('SelectOpticalConfig result: {:s}'.format(grpc_message_to_json_string(response)))
    #     return response

As far as the fabric_v1model.p4 goes - device bootstrap test is passing, but there is a new problem to be solved on the service side (attachment).

hackfest3-fabric-feat_139_branch.log

_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
src/common/tools/client/RetryDecorator.py:75: in wrapper
    return func(self, *args, **kwargs)
src/service/client/ServiceClient.py:58: in UpdateService
    response = self.stub.UpdateService(request)
../.pyenv/versions/3.9.18/envs/tfs/lib/python3.9/site-packages/grpc/_channel.py:946: in __call__
    return _end_unary_response_blocking(state, call, False, None)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

state = <grpc._channel._RPCState object at 0x7f4f72866400>, call = <grpc._cython.cygrpc.SegregatedCall object at 0x7f4f72864fc0>, with_call = False, deadline = None

    def _end_unary_response_blocking(state, call, with_call, deadline):
        if state.code is grpc.StatusCode.OK:
            if with_call:
                rendezvous = _MultiThreadedRendezvous(state, call, None, deadline)
                return state.response, rendezvous
            else:
                return state.response
        else:
>           raise _InactiveRpcError(state)
E           grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
E               status = StatusCode.RESOURCE_EXHAUSTED
E               details = "received initial metadata size exceeds limit"
E               debug_error_string = "{"created":"@1713134005.798714726","description":"Error received from peer ipv4:10.152.183.248:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"received initial metadata size exceeds limit","grpc_status":8}"
E           >

There was no relevant info in the service or sbi pods, though.

Here is more verbose log - DEBUG log level enabled, --full-trace flag enabled.

2024-04-16_fabric-service-fault_verbose.log

@avalantasis @pmalekas have you encountered similar output? Let me know.

Today we managed to replicate your issue so we are ready for fixes starting from tomorrow (likely after our call)

Hi @katsikasg, what's the status of fixes for P4 service handler? cc @avalantasis @pmalekas

added to epic &5 (closed)

P4 driver does not correctly retrieve resources

Reporters

Description

Deployment environment

TFS deployment settings

Sequence of actions that resulted in the bug

Document the explicit error

Expected behaviour

Acknowledgements

Designs

Child items ...

Activity