Skip to main content

Getting Data from RestFB and Creating Sequence File > Hadoop

Here is a quick code to get data from Facebook using RestFB API and create Sequence file and dump you data into Hadoop Cluster.

Requirement:
  • Hadoop 1.0.3 Installed as Stand Alone or Multinode. 
  • Eclipse IDE for development 
  • Hadoop and Apache commons jars. 
  • RestFB APIs 

Steps to Create Eclipse Project.
  • New Java Project. 
  • Add the jar to the project. (Apache Commons and hadoop-core.1.0.3.jar) and add RestFB jar. 
  • You will find all (commons and hadoop) jars under hadoop directory. 

Sequence File Content Format.
  • Key – <facebook_id, facebook_name, timestamp> 
  • Value – <batch_me, batch_me_friends, batch_me_likes>


Add the below code to get DATA from Facebook and generate Sequence File. Before you start you need to updated the AccessToken in the code with yours Access Token from Facebook. Take look here before you proceed.

/*
* Sequence File to HDFS Filesystem
*
* We take information from Facebook as Batch
* and then store them in Sequence file in Hadoop Distributed File System.
*
* */


import java.io.IOException;
import java.net.URI;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;

import static java.lang.String.format;
import static java.lang.System.currentTimeMillis;
import static java.lang.System.out;

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Date;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import com.restfb.Connection;
import com.restfb.DefaultFacebookClient;
import com.restfb.DefaultJsonMapper;
import com.restfb.Facebook;
import com.restfb.FacebookClient;
import com.restfb.JsonMapper;
import com.restfb.Parameter;
import com.restfb.batch.BatchRequest;
import com.restfb.batch.BatchRequest.BatchRequestBuilder;
import com.restfb.batch.BatchResponse;
import com.restfb.json.JsonArray;
import com.restfb.json.JsonObject;
import com.restfb.types.Page;
import com.restfb.types.Post;
import com.restfb.types.Url;
import com.restfb.types.User;


public class SequenceFileToHDFS {
    @SuppressWarnings("deprecation")
    public static void main(String[] args) throws IOException
    {
        String accessToken = "ACCESS_TOKEN_STRING";
       
        //get current date time with Date()
        DateFormat dateFormat = new SimpleDateFormat("yyyy_MM_dd_HH_mm_ss");
        Date date = new Date(); 
       
String uri = "sequence_file_"+dateFormat.format(date)+".seq";
        Configuration conf = new Configuration();
       
        /*
         * Uncomment this 2 lines below to get configuratio from the XML
         * Make sure the PATH is set right to get the configuration
         * Which will dump the Sequence file into HADOOP
        */
        //conf.addResource(new Path ("/usr/local/hadoop/conf/core-site.xml"));
        //conf.addResource(new Path ("/usr/local/hadoop/conf/hdfs-site.xml"));
       
        /*Comment these 2 lines below and uncomment above 2 lines to write the data into Hadoop*/
            FileSystem.getLocal(conf); //for local file system
            LocalFileSystem fs = FileSystem.getLocal(conf);
        /*Local Sequence End Here*/
       
        /* Uncomment line below to make it work with Configuration file above <Lines 94/95>*/
        //FileSystem fs = FileSystem.get(URI.create(uri), conf);
       
        Path path = new Path(uri);

        /*
         * Starting Facebook Retrieval
         */
       
        DefaultFacebookClient facebookClient = new DefaultFacebookClient(accessToken);
        User user = facebookClient.fetchObject("me", User.class);

        /*
         * Building Batch Request to send to Facebook
         */
               
        BatchRequest meRequest = new BatchRequestBuilder("me").build();
        BatchRequest meFriendRequest = new BatchRequestBuilder("me/friends").build();
        BatchRequest meLikeRequest = new BatchRequestBuilder("me/likes").parameters(Parameter.with("limit", 5)).build();

        /* Executing BATCH Request */
        /* This will be our Sequence Value*/
        List<BatchResponse> batchResponses =
            facebookClient.executeBatch(meRequest, meFriendRequest, meLikeRequest);

        /*
         * Based on the Response from Facebook
         * We create Sequence File.
         *
         */
       
        if(batchResponses.get(0).getCode() == 200)
        {
            /* Creating Sequence Key */
            JsonObject sequencekeyMapUser = new JsonObject();
            sequencekeyMapUser.put("facebookId", user.getId());
            sequencekeyMapUser.put("facebookName",user.getName());
            sequencekeyMapUser.put("timestamp", dateFormat.format(date));

            Text key = new Text();
            Text value = new Text();
            SequenceFile.Writer writer = null;
            try
            {
                writer = SequenceFile.createWriter(fs, conf, path, key.getClass(), value.getClass());
                key.set(sequencekeyMapUser.toString());
                value.set(batchResponses.toString());
                System.out.printf("[%s]\t%s\t%s\n", writer.getLength(), key, value);
                writer.append(key, value);

            }
            finally
            {
                IOUtils.closeStream(writer);
            }
        }
        else if(batchResponses.get(0).getCode() != 200)
        {
            System.out.printf("Access Token Expired\n");
        }
    }
}

Comments

Popular posts from this blog

Cloudera Manager - Duplicate entry 'zookeeper' for key 'NAME'.

We had recently built a cluster using cloudera API’s and had all the services running on it with Kerberos enabled. Next we had a requirement to add another kafka cluster to our already exsisting cluster in cloudera manager. Since it is a quick task to get the zookeeper and kafka up and running. We decided to get this done using the cloudera manager instead of the API’s. But we faced the Duplicate entry 'zookeeper' for key 'NAME' issue as described in the bug below. https://issues.cloudera.org/browse/DISTRO-790 I have set up two clusters that share a Cloudera Manger. The first I set up with the API and created the services with capital letter names, e.g., ZOOKEEPER, HDFS, HIVE. Now, I add the second cluster using the Wizard. Add Cluster->Select Hosts->Distribute Parcels->Select base HDFS Cluster install On the next page i get SQL errros telling that the services i want to add already exist. I suspect that the check for existing service names does n

Zabbix History Table Clean Up

Zabbix history table gets really big, and if you are in a situation where you want to clean it up. Then we can do so, using the below steps. Stop zabbix server. Take table backup - just in case. Create a temporary table. Update the temporary table with data required, upto a specific date using epoch . Move old table to a different table name. Move updated (new temporary) table to original table which needs to be cleaned-up. Drop the old table. (Optional) Restart Zabbix Since this is not offical procedure, but it has worked for me so use it at your own risk. Here is another post which will help is reducing the size of history tables - http://zabbixzone.com/zabbix/history-and-trends/ Zabbix Version : Zabbix v2.4 Make sure MySql 5.1 is set with InnoDB as innodb_file_per_table=ON Step 1 Stop the Zabbix server sudo service zabbix-server stop Script. echo "------------------------------------------" echo " 1. Stopping Zabbix Server &quo

Access Filter in SSSD `ldap_access_filter` [SSSD Access denied / Permission denied ]

Access Filter Setup with SSSD ldap_access_filter (string) If using access_provider = ldap , this option is mandatory. It specifies an LDAP search filter criteria that must be met for the user to be granted access on this host. If access_provider = ldap and this option is not set, it will result in all users being denied access. Use access_provider = allow to change this default behaviour. Example: access_provider = ldap ldap_access_filter = memberOf=cn=allowed_user_groups,ou=Groups,dc=example,dc=com Prerequisites yum install sssd Single LDAP Group Under domain/default in /etc/sssd/sssd.conf add: access_provider = ldap ldap_access_filter = memberOf=cn=Group Name,ou=Groups,dc=example,dc=com Multiple LDAP Groups Under domain/default in /etc/sssd/sssd.conf add: access_provider = ldap ldap_access_filter = (|(memberOf=cn=System Adminstrators,ou=Groups,dc=example,dc=com)(memberOf=cn=Database Users,ou=Groups,dc=example,dc=com)) ldap_access_filter accepts standa