HBaseWriter (HBase Writer 0.90.4-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.archive.io.hbase
Class HBaseWriter

java.lang.Object
  org.archive.io.hbase.HBaseWriter

public class HBaseWriter
extends Object
extends Object

HBase implementation.

Constructor Summary
`HBaseWriter(String zkQuorum, int zkClientPort, String tableName, HBaseParameters parameters)` Instantiates a new HBaseWriter for the WriterPool to use in heritrix.

Method Summary
`protected byte[]`	`getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream, int streamSize)` Read the ReplayInputStream and write it to the given BatchUpdate with the given column.
`org.apache.hadoop.hbase.client.HTable`	`getClient()` Gets the HTable client.
`HBaseParameters`	`getHbaseOptions()`
`protected void`	`initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration, String hbaseTableName)` Creates the crawl table in HBase.
`protected void`	`processContent(org.apache.hadoop.hbase.client.Put put, org.archive.io.ReplayInputStream replayInputStream, int streamSize)` This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns.
`void`	`write(org.archive.modules.CrawlURI curi, String ip, org.archive.io.RecordingOutputStream recordingOutputStream, org.archive.io.RecordingInputStream recordingInputStream)` Write the crawled output to the configured HBase table.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

HBaseWriter

public HBaseWriter(String zkQuorum,
                   int zkClientPort,
                   String tableName,
                   HBaseParameters parameters)
            throws IOException

Instantiates a new HBaseWriter for the WriterPool to use in heritrix.

Parameters:: zkQuorum - the zookeeper quorum. The list of hosts that make up you zookeeper quorum. i.e.: zkHost1,zkHost2,zkHost3; zkClientPort - the zookeeper client port that clients should try to connect on for servers in the zk quorum. This value is analgous to the hase-site.xml config parameter: hbase.zookeeper.property.clientPort; tableName - the table in hbase to write to. i.e. : webtable; parameters - an HBaseParameters object consisting of parameters list
Throws:: IOException - Signals that an I/O exception has occurred.

Method Detail

getHbaseOptions

public HBaseParameters getHbaseOptions()

See Also:: HBaseParameters

getClient

public org.apache.hadoop.hbase.client.HTable getClient()

Gets the HTable client.

Returns:: the client

initializeCrawlTable

protected void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
                                    String hbaseTableName)
                             throws IOException

Creates the crawl table in HBase.

Parameters:: hbaseConfiguration - the c; hbaseTableName - the table
Throws:: IOException - Signals that an I/O exception has occurred.

getByteArrayFromInputStream

protected byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
                                             int streamSize)
                                      throws IOException

Read the ReplayInputStream and write it to the given BatchUpdate with the given column.

Parameters:: replayInputStream - the ris the cell data as a replay input stream; streamSize - the size
Returns:: the byte array from input stream
Throws:: IOException - Signals that an I/O exception has occurred.

processContent

protected void processContent(org.apache.hadoop.hbase.client.Put put,
                              org.archive.io.ReplayInputStream replayInputStream,
                              int streamSize)
                       throws IOException

This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. For Example : html parsing, text extraction, analysis and transformation and storing the results in new column families/columns using the batch update object. Or even saving the values in other custom hbase tables or other remote data sources. (a.k.a. anything you want)

Parameters:: put - the stateful put object containing all the row data to be written.; replayInputStream - the replay input stream containing the raw content gotten by heritrix crawler.; streamSize - the stream size
Throws:: IOException - Signals that an I/O exception has occurred.

write

public void write(org.archive.modules.CrawlURI curi,
                  String ip,
                  org.archive.io.RecordingOutputStream recordingOutputStream,
                  org.archive.io.RecordingInputStream recordingInputStream)
           throws IOException

Write the crawled output to the configured HBase table. Write each row key as the url with reverse domain and optionally process any content.

Parameters:: curi - URI of crawled document; ip - IP of remote machine.; recordingOutputStream - recording input stream that captured the response; recordingInputStream - recording output stream that captured the GET request
Throws:: IOException - Signals that an I/O exception has occurred.