1. Background#

I have a Doris cluster running on K8S that consists of one FE and three BE nodes.

After making some configuration changes, I restarted the FE node. However, the FE failed to start up and encountered errors.

1
2
3
4
5
6
7
8
9
10
RuntimeLogger 2025-02-06 08:49:51,614 INFO (stateListener|92) [Env$5.runOneCycle():2690] begin to transfer FE type from INIT to UNKNOWN
RuntimeLogger 2025-02-06 08:49:51,615 INFO (stateListener|92) [Env$5.runOneCycle():2777] finished to transfer FE type to UNKNOWN
RuntimeLogger 2025-02-06 08:49:51,715 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:1 reason:
RuntimeLogger 2025-02-06 08:50:01,727 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:101 reason:
RuntimeLogger 2025-02-06 08:50:11,746 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:201 reason:
RuntimeLogger 2025-02-06 08:50:21,766 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:301 reason:
RuntimeLogger 2025-02-06 08:50:31,782 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:401 reason:
RuntimeLogger 2025-02-06 08:50:41,789 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:501 reason:
RuntimeLogger 2025-02-06 08:50:51,798 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:601 reason:
RuntimeLogger 2025-02-06 08:51:01,806 INFO (UNKNOWN fe_f4b9da77_6540_4377_90b8_3cf5a974c428(-1)|1) [Env.waitForReady():1082] wait catalog to be ready. feType:UNKNOWN isReady:false, counter:701 reason:

Based on the error message wait catalog to be ready. feType:UNKNOWN isReady:false, this appears to be a metadata corruption issue. While there is documentation available for Doris metadata recovery at https://doris.apache.org/docs/admin-manual/trouble-shooting/metadata-operation, it doesn’t specifically cover how to perform the recovery in a K8S environment.

2. Solution#

In a K8S environment, the FE container is started by fe_entrypoint.sh, which contains the following code:

1
2
3
4
local recovery=`grep "\<selectdb.com.doris/recovery\>" $ANNOTATION_PATH | grep -v '^\s*#' | sed 's|^\s*'$confkey'\s*=\s*\(.*\)\s*$|\1|g'`
if [[ "x$recovery" != "x" ]]; then
opts=${opts}" --metadata_failure_recovery"
fi

This means we need to add an annotation to the FE StatefulSet to enable metadata recovery. Here’s how to do it:

  1. Add the recovery annotation to the FE StatefulSet:
1
kubectl patch sts doris-fe -n your-namespace -p '{"spec":{"template":{"metadata":{"annotations":{"selectdb.com.doris/recovery":"true"}}}}}'
  1. Delete the FE pod to trigger a restart:
1
kubectl delete pod doris-fe-0 -n your-namespace

The FE pod will be recreated with the recovery annotation, and the --metadata_failure_recovery parameter will be added to the startup command. This should allow the FE to recover from the metadata corruption.

  1. Monitor the FE logs to verify the recovery process:
1
kubectl logs -f doris-fe-0 -n your-namespace

Once the recovery is complete, you should see the FE successfully start up and transition to the MASTER state. After confirming everything is working correctly, you can remove the recovery annotation:

1
kubectl patch sts doris-fe -n your-namespace -p '{"spec":{"template":{"metadata":{"annotations":{"selectdb.com.doris/recovery":null}}}}}'

Note: Make sure to replace your-namespace and cluster name.