org.archive.io.hbase
Class HBaseWriter

java.lang.Object
  extended by org.archive.io.hbase.HBaseWriter

public class HBaseWriter
extends Object

HBase implementation.


Constructor Summary
HBaseWriter(String zkQuorum, int zkClientPort, String tableName, HBaseParameters parameters)
          Instantiates a new HBaseWriter for the WriterPool to use in heritrix.
 
Method Summary
protected  byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream, int streamSize)
          Read the ReplayInputStream and write it to the given BatchUpdate with the given column.
 org.apache.hadoop.hbase.client.HTable getClient()
          Gets the HTable client.
 HBaseParameters getHbaseOptions()
           
protected  void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration, String hbaseTableName)
          Creates the crawl table in HBase.
protected  void processContent(org.apache.hadoop.hbase.client.Put put, org.archive.io.ReplayInputStream replayInputStream, int streamSize)
          This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns.
 void write(org.archive.modules.CrawlURI curi, String ip, org.archive.io.RecordingOutputStream recordingOutputStream, org.archive.io.RecordingInputStream recordingInputStream)
          Write the crawled output to the configured HBase table.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HBaseWriter

public HBaseWriter(String zkQuorum,
                   int zkClientPort,
                   String tableName,
                   HBaseParameters parameters)
            throws IOException
Instantiates a new HBaseWriter for the WriterPool to use in heritrix.

Parameters:
zkQuorum - the zookeeper quorum. The list of hosts that make up you zookeeper quorum. i.e.: zkHost1,zkHost2,zkHost3
zkClientPort - the zookeeper client port that clients should try to connect on for servers in the zk quorum. This value is analgous to the hase-site.xml config parameter: hbase.zookeeper.property.clientPort
tableName - the table in hbase to write to. i.e. : webtable
parameters - an HBaseParameters object consisting of parameters list
Throws:
IOException - Signals that an I/O exception has occurred.
Method Detail

getHbaseOptions

public HBaseParameters getHbaseOptions()
See Also:
HBaseParameters

getClient

public org.apache.hadoop.hbase.client.HTable getClient()
Gets the HTable client.

Returns:
the client

initializeCrawlTable

protected void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
                                    String hbaseTableName)
                             throws IOException
Creates the crawl table in HBase.

Parameters:
hbaseConfiguration - the c
hbaseTableName - the table
Throws:
IOException - Signals that an I/O exception has occurred.

getByteArrayFromInputStream

protected byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
                                             int streamSize)
                                      throws IOException
Read the ReplayInputStream and write it to the given BatchUpdate with the given column.

Parameters:
replayInputStream - the ris the cell data as a replay input stream
streamSize - the size
Returns:
the byte array from input stream
Throws:
IOException - Signals that an I/O exception has occurred.

processContent

protected void processContent(org.apache.hadoop.hbase.client.Put put,
                              org.archive.io.ReplayInputStream replayInputStream,
                              int streamSize)
                       throws IOException
This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. For Example : html parsing, text extraction, analysis and transformation and storing the results in new column families/columns using the batch update object. Or even saving the values in other custom hbase tables or other remote data sources. (a.k.a. anything you want)

Parameters:
put - the stateful put object containing all the row data to be written.
replayInputStream - the replay input stream containing the raw content gotten by heritrix crawler.
streamSize - the stream size
Throws:
IOException - Signals that an I/O exception has occurred.

write

public void write(org.archive.modules.CrawlURI curi,
                  String ip,
                  org.archive.io.RecordingOutputStream recordingOutputStream,
                  org.archive.io.RecordingInputStream recordingInputStream)
           throws IOException
Write the crawled output to the configured HBase table. Write each row key as the url with reverse domain and optionally process any content.

Parameters:
curi - URI of crawled document
ip - IP of remote machine.
recordingOutputStream - recording input stream that captured the response
recordingInputStream - recording output stream that captured the GET request
Throws:
IOException - Signals that an I/O exception has occurred.


Copyright © 2007-2011. All Rights Reserved.