This is the second follow up post. In the earlier post
Setting Up Cloudera Manager Services Using Cloudera API [Part 1]
, We installed the cloudera management services.Setting Up Zookeeper Services Using Cloudera API [Part 2]
Installing Zookeeper service to the cluster.Now we will be installing the HDFS service to our cluster.
- Create a cluster.
- Install
HDFS
service to our cluster.Creating a Cluster and setting up parcel is part of earlier post.
Install HDFS Service.
HDFS
service is installed in stages.- Create a
HDFS
service (if not exist). - Update configuration for our newly create
HDFS
service. - Create
HDFS
roles (NAMENODE
,SECONDARYNAMENODE
,DATANODE
,GATEWAY
) on the Cluster. - Format Namenode.
- Start
HDFS
service. - Create Temporary
/tmp
directory inHDFS
Create a
HDFS
service.This is simple create a service if it does not exist.def create_service(cluster): try: zk_service = cluster.get_service('HDFS') logging.debug("Service {0} already present on the cluster".format('HDFS')) except ApiException: # # Create service if it the first time. # zk_service = cluster.create_service('HDFS', 'HDFS') logging.info("Created New Service: HDFS") return zk_service
Update configuration for
HDFS
.This information is picked up from the configuration yaml file.yaml file.HDFS: config: dfs_replication: 3 dfs_permissions: false dfs_block_local_path_access_user: impala,hbase,mapred,spark roles: - group: NAMENODE hosts: - mycmhost.ahmed.com config: dfs_name_dir_list: /data/1/dfs/nn,/data/2/dfs/nn dfs_namenode_handler_count: 30 - group: SECONDARYNAMENODE hosts: - mycmhost.ahmed.com config: fs_checkpoint_dir_list: /data/1/dfs/snn,/data/2/dfs/snn - group: DATANODE hosts: - mycmhost.ahmed.com config: dfs_data_dir_list: /data/1/dfs/dn,/data/2/dfs/dn dfs_datanode_handler_count: 30 #dfs_datanode_du_reserved: 1073741824 dfs_datanode_failed_volumes_tolerated: 0 dfs_datanode_data_dir_perm: 755 - group: GATEWAY hosts: - mycmhost.ahmed.com config: dfs_client_use_trash: true
Code snippet.def service_update_configuration(zk_service): """ Update service configurations :return: """ zk_service.update_config(config['services']['HDFS']['config']) logging.info("Service Configuration Updated.")
Create
HDFS
roles (NAMENODE
,SECONDARYNAMENODE
,DATANODE
,GATEWAY
) on the Cluster.To create all the roles.
- Each role needs to be unique on each host.
- We create a unique
role_name
for each node.Each role is unique based on below set of strings. (service_name
,group
,role_id
)role_name = '{0}-{1}-{2}'.format(service_name, group, role_id)
Here is the code snippet.def hdfs_create_cluster_services(config, service, service_name): """ Creating Cluster services :return: """ # # Get the role config for the group # Update group configuration and create roles. # for role in config['services'][service_name]['roles']: role_group = service.get_role_config_group("{0}-{1}-BASE".format(service_name, role['group'])) # # Update the group's configuration. # [https://cloudera.github.io/cm_api/epydoc/5.10.0/cm_api.endpoints.role_config_groups.ApiRoleConfigGroup-class.html#update_config] # role_group.update_config(role.get('config', {})) # # Create roles now. # hdfs_create_roles(service, service_name, role, role['group']) def hdfs_create_roles(service, service_name, role, group): """ Create individual roles for all the hosts under a specific role group :param role: Role configuration from yaml :param group: Role group name """ role_id = 0 for host in role.get('hosts', []): role_id += 1 role_name = '{0}-{1}-{2}'.format(service_name, group, role_id) logging.info("Creating Role name as: " + str(role_name)) try: service.get_role(role_name) except ApiException: service.create_role(role_name, group, host)
Format Namenode.
First time when we create a HDFS environment we need to format namenode, this
init
the HDFS cluster.format_hdfs
method returns as ApiCommand
which we can track progress of execution. def format_namenode(hdfs_service, namenode):
try:
#
# Formatting HDFS - this will have no affect the second time it runs.
# Format NameNode instances of an HDFS service.
#
# https://cloudera.github.io/cm_api/epydoc/5.10.0/cm_api.endpoints.services.ApiService-class.html#format_hdfs
#
cmd = hdfs_service.format_hdfs(namenode)[0]
logging.debug("Command Response: " + str(cmd))
if not cmd.wait(300).success:
print "WARNING: Failed to format HDFS, attempting to continue with the setup"
except ApiException:
logging.info("HDFS cannot be formatted. May be already in use.")
Start HDFS
service
We do this using the
More details about the Api here
service.start()
method. This method return ApiCommand
which we can track the progress and wait for the service to start using cmd.wait().success
More details about the Api here
Our service should be up and running.
Finally Creating /tmp
directory on HDFS.
When we create a HDFS cluster we create a
Mapreduce artifacts, intermediate data will be kept under this directory. If we delete the
/tmp
directory, HDFS /tmp directory is used as a temporary storage during mapreduce operation.Mapreduce artifacts, intermediate data will be kept under this directory. If we delete the
/tmp
contents then any MR jobs currently running will loose its current intermediate data.
Any MR run after the
/tmp
is clear will still work without any issues.
Creating
/tmp
is done using create_hdfs_tmp
method which returns ApiCommand
response.
Excellent follow-up in the HDFS setup series! The step-by-step guide using the Cloudera API makes configuring HDFS services so much easier. If you’re looking to share more big data and Hadoop tips, Host My Code is a great platform to start your own blog and share your expertise with the tech community!
ReplyDelete