In the past time I was thinking the SecondaryNameNode is acting as a backup node for high availability of Hadoop.

Here are the Hadoop processes in my machine.

 $ jps
1698 SecondaryNameNode
1449 DataNode
4825 Jps
1295 NameNode

From reading the book I have known it's not that. The books says:

  • An optional HDFS master node process is the SecondaryNameNode, often simplified to 2NN.
  • The SecondaryNameNode is located on a different host from the NameNode, and periodically performs a recovery operation on behalf of the primary NameNode, going through the same sequence of operations that the primary NameNode would do under a normal startup routine). This includes taking the current fsimage file from the NameNode, applying all updates from the edits files in sequence, and then creating a new fsimage file. The resultant point-in-time filesysytem snapshot (fsimage file) is then replaced on the primary NameNode, meaning subsequent recovery operations are substantially shortened. This process is called checkpointing and is scheduled to occur on the SecondaryNameNode at a configured frequency. The checkpointing process also has the side benefit of reducing the amount of disk space consumed by the journaling function (edits files).
  • The SecondaryNameNode is not a high availability (HA) solution as the SecondaryNameNode is not a hot standby for the NameNode. The SecondaryNameNode, however, does provide an alternate storage location for the on-disk representation of the NameNode’s metadata in case of a catastrophic failure on the NameNode.
  • If you are running HDFS in HA mode, which will be covered in Hour 21, “Understanding Advanced HDFS,” then the Standby NameNode assumes the same responsibilities (for checkpointing) that the SecondaryNameNode would normally do.

Here is the book's link:

Sams Teach Yourself Hadoop in 24 Hours