com.powerset.heritrix.writer
Class HBaseWriter

java.lang.Object
  extended by org.archive.io.WriterPoolMember
      extended by com.powerset.heritrix.writer.HBaseWriter
All Implemented Interfaces:
org.archive.io.ArchiveFileConstants

public class HBaseWriter
extends org.archive.io.WriterPoolMember

Write crawled content as records to an HBase table. Puts content into the 'content:raw_data' column and all else into the 'curi:' column family. Makes a row key of an url transformation. Creates table if it does not exist. The following is a complete list of columns that get written to by default: content:raw_data curi:ip curi:path-from-seed curi:is-seed curi:via curi:url curi:request

Limitations: Hard-coded table schema.


Field Summary
static String CONTENT_COLUMN_FAMILY
          The Constant CONTENT_COLUMN_FAMILY.
static String CONTENT_COLUMN_NAME
          The Constant CONTENT_COLUMN.
static String CURI_COLUMN_FAMILY
          The Constant CURI_COLUMN_FAMILY.
 
Fields inherited from class org.archive.io.WriterPoolMember
DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_VARIABLE, UTF8
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
HBaseWriter(String zkQuorum, int zkClientPort, String tableName)
          Instantiates a new HBaseWriter for the WriterPool to use in heritrix2.
 
Method Summary
protected  void createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration hbaseConfiguration, String hbaseTableName)
          Creates the crawl table in HBase.
protected  byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream, int streamSize)
          Read the ReplayInputStream and write it to the given BatchUpdate with the given column.
 org.apache.hadoop.hbase.client.HTable getClient()
          Gets the HTable client.
protected  void processContent(org.apache.hadoop.hbase.client.Put put, org.archive.io.ReplayInputStream replayInputStream, int streamSize)
          This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns.
 void write(org.archive.modules.ProcessorURI curi, String ip, org.archive.io.RecordingOutputStream recordingOutputStream, org.archive.io.RecordingInputStream recordingInputStream)
          Write the crawled output to the configured HBase table.
 
Methods inherited from class org.archive.io.WriterPoolMember
checkSize, checkWriteable, close, copyFrom, createFile, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CONTENT_COLUMN_FAMILY

public static final String CONTENT_COLUMN_FAMILY
The Constant CONTENT_COLUMN_FAMILY.

See Also:
Constant Field Values

CONTENT_COLUMN_NAME

public static final String CONTENT_COLUMN_NAME
The Constant CONTENT_COLUMN.

See Also:
Constant Field Values

CURI_COLUMN_FAMILY

public static final String CURI_COLUMN_FAMILY
The Constant CURI_COLUMN_FAMILY.

See Also:
Constant Field Values
Constructor Detail

HBaseWriter

public HBaseWriter(String zkQuorum,
                   int zkClientPort,
                   String tableName)
            throws IOException
Instantiates a new HBaseWriter for the WriterPool to use in heritrix2.

Parameters:
zkQuorum - the zookeeper quorum. The list of hosts that make up you zookeeper quorum. i.e.: zkHost1,zkHost2,zkHost3
zkClientPort - the zookeeper client port that clients should try to connect on for servers in the zk quorum. This value is analgous to the hase-site.xml config parameter: hbase.zookeeper.property.clientPort
tableName - the table in hbase to write to. i.e. : webtable
Throws:
IOException - Signals that an I/O exception has occurred.
Method Detail

getClient

public org.apache.hadoop.hbase.client.HTable getClient()
Gets the HTable client.

Returns:
the client

createCrawlTable

protected void createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration hbaseConfiguration,
                                String hbaseTableName)
                         throws IOException
Creates the crawl table in HBase.

Parameters:
hbaseConfiguration - the c
hbaseTableName - the table
Throws:
IOException - Signals that an I/O exception has occurred.

write

public void write(org.archive.modules.ProcessorURI curi,
                  String ip,
                  org.archive.io.RecordingOutputStream recordingOutputStream,
                  org.archive.io.RecordingInputStream recordingInputStream)
           throws IOException
Write the crawled output to the configured HBase table. Write each row key as the url with reverse domain and optionally process any content.

Parameters:
curi - URI of crawled document
ip - IP of remote machine.
recordingOutputStream - recording input stream that captured the response
recordingInputStream - recording output stream that captured the GET request
Throws:
IOException - Signals that an I/O exception has occurred.

getByteArrayFromInputStream

protected byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
                                             int streamSize)
                                      throws IOException
Read the ReplayInputStream and write it to the given BatchUpdate with the given column.

Parameters:
replayInputStream - the ris the cell data as a replay input stream
streamSize - the size
Returns:
the byte array from input stream
Throws:
IOException - Signals that an I/O exception has occurred.

processContent

protected void processContent(org.apache.hadoop.hbase.client.Put put,
                              org.archive.io.ReplayInputStream replayInputStream,
                              int streamSize)
                       throws IOException
This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. For Example : html parsing, text extraction, analysis and transformation and storing the results in new column families/columns using the batch update object. Or even saving the values in other custom hbase tables or data sources.

Parameters:
put - the stateful put object containg all the row data to be written.
replayInputStream - the replay input stream containing the raw content gotten by heritrix crawler.
streamSize - the stream size
Throws:
IOException - Signals that an I/O exception has occurred.


Copyright © 2007-2009. All Rights Reserved.