Wednesday, August 2, 2017

Check Point Firewall: Modifying The FWKERN.CONF File To Overcome Dropped Packets From The Queue Buffer

Here recently, I had a server guy come to me and tell me that he needed some network help to get an issue of his resolved.  Long story short, his NetApp replication from one site to another was failing, and he couldn't find anything wrong in his configuration to solve the issue.  After troubleshooting the firewall and network from my perspective, I didn't see anything wrong either.  This, needless to say, did not help him out any.
However, after further review, I found that the reason I didn't see anything in my firewall logs was because it wasn't making it to the Check Point application itself.  There actually were dropped packets, just at the OS level.  This took some time to troubleshoot, but what we found was that the queue limit buffer was getting too much traffic and was dropping packets.
So, what did we do?  Well, the default queue limit is set to 2048 by default (in Gaia on the Check Point appliances).  We wanted to up that limit to 8196, since we had plenty of memory to do so (don't do this unless you know for sure you have plenty of resources, as this may not resolve your issue).  In this case, my CPU (CPU #1) was consistently hitting 100% utilization.  So, time to edit the fwkern.conf file.
After logging into Check Point in CLI, and going into expert mode, I then went to /var/opt/fw.boot/modules directory.  There, the fwkern.conf file resides.  I went into VI editor and put in the following:
fwmultik_input_queue_len = 8196

After coming out of VI editor and rebooting the HA cluster, everything worked well and his NetApp issue was resolved.  No more dropped packets from the buffer and CPU down to 10%.  To check what your setting is at, do the following:
[Expert@CheckPoint:0]# fw -i k ctl get int fwmultik_input_queue_len
fwmultik_input_queue_len = 2048

3 comments:

  1. That is some EXCELLENT troubleshooting Shane!!! A great example of digging until you find the answer!!

    ReplyDelete
  2. Thanks for sharing this troubleshooting experience. We actually had the same issue recently and during the incident we can see one specific firewall kernel instance is fully utilised with 99% CPU and the zdebug result provided us packet drop logs similar as "...dropped by fwkdrv_enqueue_packet_user_ex Reason: Instance is currently fully utilized;" or "...dropped by cphwd_pslglue_handle_packet_cb Reason: F2P: Instance is currently fully utilized;". Unluckily we did not find this article at that time and decided to manually fail over the cluster and the issue went away. Later on Checkpoint support told us to change this kernel parameter "fwmultik_input_queue_len" from the default 2048 to 8196. So far we have not identified the same issue however we could not conclude that the queue buffer limit is the root cause of the issue. Wondering how you identified this queue buffer limit? Is there any metric that we can check to verify if the queue buffer is full?

    ReplyDelete
  3. As I see in the end, the value fwmultik_input_queue_len hasn't been changed and stayed 2048..... why that?

    ReplyDelete

Your comment will be reviewed for approval. Thank you for submitting your comments.