Skip to main content

Setting Up HDFS Services Using Cloudera API [Part 3]

This is the second follow up post. In the earlier post
  1. Create a cluster.
  2. Install HDFS service to our cluster.
    Creating a Cluster and setting up parcel is part of earlier post.

Install HDFS Service.

HDFS service is installed in stages.
  1. Create a HDFS service (if not exist).
  2. Update configuration for our newly create HDFS service.
  3. Create HDFS roles (NAMENODE, SECONDARYNAMENODE, DATANODE, GATEWAY) on the Cluster.
  4. Format Namenode.
  5. Start HDFS service.
  6. Create Temporary /tmp directory in HDFS

    Create a HDFS service.

    This is simple create a service if it does not exist.
    def create_service(cluster):
     try:
         zk_service = cluster.get_service('HDFS')
         logging.debug("Service {0} already present on the cluster".format('HDFS'))
     except ApiException:
         #
         # Create service if it the first time.
         #
         zk_service = cluster.create_service('HDFS', 'HDFS')
         logging.info("Created New Service: HDFS")
    
     return zk_service
    

    Update configuration for HDFS.

    This information is picked up from the configuration yaml file.
    yaml file.
    HDFS:
     config:
       dfs_replication: 3
       dfs_permissions: false
       dfs_block_local_path_access_user: impala,hbase,mapred,spark
     roles:
       - group: NAMENODE
         hosts:
           - mycmhost.ahmed.com
         config:
           dfs_name_dir_list: /data/1/dfs/nn,/data/2/dfs/nn
           dfs_namenode_handler_count: 30
       - group: SECONDARYNAMENODE
         hosts:
           - mycmhost.ahmed.com
         config:
           fs_checkpoint_dir_list: /data/1/dfs/snn,/data/2/dfs/snn
    
       - group: DATANODE
         hosts:
           - mycmhost.ahmed.com
         config:
           dfs_data_dir_list: /data/1/dfs/dn,/data/2/dfs/dn
           dfs_datanode_handler_count: 30
           #dfs_datanode_du_reserved: 1073741824
           dfs_datanode_failed_volumes_tolerated: 0
           dfs_datanode_data_dir_perm: 755
       - group: GATEWAY
         hosts:
           - mycmhost.ahmed.com
         config:
           dfs_client_use_trash: true
    
    Code snippet.
    def service_update_configuration(zk_service):
     """
         Update service configurations
     :return:
     """
     zk_service.update_config(config['services']['HDFS']['config'])
     logging.info("Service Configuration Updated.")
    

    Create HDFS roles (NAMENODE, SECONDARYNAMENODE, DATANODE, GATEWAY) on the Cluster.

    To create all the roles.
  • Each role needs to be unique on each host.
  • We create a unique role_name for each node.
    Each role is unique based on below set of strings. (service_name, group, role_id)
    role_name = '{0}-{1}-{2}'.format(service_name, group, role_id)
    
    Here is the code snippet.
    def hdfs_create_cluster_services(config, service, service_name):
      """
          Creating Cluster services
      :return:
      """
    
      #
      # Get the role config for the group
      # Update group configuration and create roles.
      #
      for role in config['services'][service_name]['roles']:
          role_group = service.get_role_config_group("{0}-{1}-BASE".format(service_name, role['group']))
          #
          # Update the group's configuration.
          # [https://cloudera.github.io/cm_api/epydoc/5.10.0/cm_api.endpoints.role_config_groups.ApiRoleConfigGroup-class.html#update_config]
          #
          role_group.update_config(role.get('config', {}))
          #
          # Create roles now.
          #
          hdfs_create_roles(service, service_name, role, role['group'])
    
    def hdfs_create_roles(service, service_name, role, group):
      """
      Create individual roles for all the hosts under a specific role group
    
      :param role: Role configuration from yaml
      :param group: Role group name
      """
      role_id = 0
      for host in role.get('hosts', []):
          role_id += 1
          role_name = '{0}-{1}-{2}'.format(service_name, group, role_id)
          logging.info("Creating Role name as: " + str(role_name))
          try:
              service.get_role(role_name)
          except ApiException:
              service.create_role(role_name, group, host)
    

Format Namenode.

First time when we create a HDFS environment we need to format namenode, this init the HDFS cluster.
format_hdfs method returns as ApiCommand which we can track progress of execution.
 def format_namenode(hdfs_service, namenode):
     try:
         #
         # Formatting HDFS - this will have no affect the second time it runs.
         # Format NameNode instances of an HDFS service.
         #
         # https://cloudera.github.io/cm_api/epydoc/5.10.0/cm_api.endpoints.services.ApiService-class.html#format_hdfs
         #

         cmd = hdfs_service.format_hdfs(namenode)[0]
         logging.debug("Command Response: " + str(cmd))
         if not cmd.wait(300).success:
             print "WARNING: Failed to format HDFS, attempting to continue with the setup"
     except ApiException:
         logging.info("HDFS cannot be formatted. May be already in use.")

Start HDFS service

We do this using the service.start() method. This method return ApiCommand which we can track the progress and wait for the service to start using cmd.wait().success
More details about the Api here
Our service should be up and running.

Finally Creating /tmp directory on HDFS.

When we create a HDFS cluster we create a /tmp directory, HDFS /tmp directory is used as a temporary storage during mapreduce operation.
Mapreduce artifacts, intermediate data will be kept under this directory. If we delete the /tmp contents then any MR jobs currently running will loose its current intermediate data.
Any MR run after the /tmp is clear will still work without any issues.
Creating /tmp is done using create_hdfs_tmp method which returns ApiCommand response.

Code Location

Comments

  1. Excellent follow-up in the HDFS setup series! The step-by-step guide using the Cloudera API makes configuring HDFS services so much easier. If you’re looking to share more big data and Hadoop tips, Host My Code is a great platform to start your own blog and share your expertise with the tech community!

    ReplyDelete

Post a Comment

Popular posts from this blog

Zabbix History Table Clean Up

Zabbix history table gets really big, and if you are in a situation where you want to clean it up. Then we can do so, using the below steps. Stop zabbix server. Take table backup - just in case. Create a temporary table. Update the temporary table with data required, upto a specific date using epoch . Move old table to a different table name. Move updated (new temporary) table to original table which needs to be cleaned-up. Drop the old table. (Optional) Restart Zabbix Since this is not offical procedure, but it has worked for me so use it at your own risk. Here is another post which will help is reducing the size of history tables - http://zabbixzone.com/zabbix/history-and-trends/ Zabbix Version : Zabbix v2.4 Make sure MySql 5.1 is set with InnoDB as innodb_file_per_table=ON Step 1 Stop the Zabbix server sudo service zabbix-server stop Script. echo "------------------------------------------" echo " 1. Stopping Zabbix Server &quo

Access Filter in SSSD `ldap_access_filter` [SSSD Access denied / Permission denied ]

Access Filter Setup with SSSD ldap_access_filter (string) If using access_provider = ldap , this option is mandatory. It specifies an LDAP search filter criteria that must be met for the user to be granted access on this host. If access_provider = ldap and this option is not set, it will result in all users being denied access. Use access_provider = allow to change this default behaviour. Example: access_provider = ldap ldap_access_filter = memberOf=cn=allowed_user_groups,ou=Groups,dc=example,dc=com Prerequisites yum install sssd Single LDAP Group Under domain/default in /etc/sssd/sssd.conf add: access_provider = ldap ldap_access_filter = memberOf=cn=Group Name,ou=Groups,dc=example,dc=com Multiple LDAP Groups Under domain/default in /etc/sssd/sssd.conf add: access_provider = ldap ldap_access_filter = (|(memberOf=cn=System Adminstrators,ou=Groups,dc=example,dc=com)(memberOf=cn=Database Users,ou=Groups,dc=example,dc=com)) ldap_access_filter accepts standa

Installing Zabbix Version 2.4 Offline (Zabbix Server without Internet).

There might be situations where you have a remote/zabbix server which does not have internet connectivity, due to security or other reasons. So we create a custom repo on the remote/zabbix server so that we can install zabbix using rpms Here is how we are planning to do this. Download all the dependency rpms on a machine which has internet connection, using yum-downloadonly or repotrack . Transfer all the rpms to the remote server. Create a repo on the remote server. Update yum configuration. Install. NOTE: This method can be used to install any application, but here we have used zabbix as we had this requirement for a zabbix server. Download dependent rpms . On a machine which has internet connection install the package below. And download all the rpms . Make sure the system are similar (not required to be identical - At-least the OS should be of same version) mkdir /zabbix_rpms yum install yum-downloadonly Downloading all the rpms to location /zabbix_rpms/ ,