org.archive.io.hbase
Class HBaseWriter

java.lang.Object
  extended by org.archive.io.WriterPoolMember
      extended by org.archive.io.hbase.HBaseWriter
All Implemented Interfaces:
org.archive.io.ArchiveFileConstants, Serializer

public class HBaseWriter
extends org.archive.io.WriterPoolMember
implements Serializer

HBase implementation.


Field Summary
 
Fields inherited from class org.archive.io.WriterPoolMember
countOut, currentBasename, currentTimestamp, DEFAULT_PREFIX, DEFAULT_TEMPLATE, f, out, rebuf, roundRobinIndex, scratchbuffer, serialNoFormatter, settings, UTF8
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
HBaseWriter(AtomicInteger serialNo, org.archive.io.WriterPoolSettings settings, HBaseParameters parameters)
          Instantiates a new h base writer.
 
Method Summary
protected  byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream, int streamSize)
          Read the ReplayInputStream and write it to the given BatchUpdate with the given column.
 org.apache.hadoop.hbase.client.HTable getClient()
          Gets the HTable client.
 HBaseParameters getHbaseOptions()
          Gets the hbase options.
protected  void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration, String hbaseTableName)
          Creates the crawl table in HBase.
protected  void processContent(org.apache.hadoop.hbase.client.Put put, org.archive.io.ReplayInputStream replayInputStream, int streamSize)
          This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns.
 byte[] serialize(byte[] bytes)
          Implement if you want to serialize bytes in a custom manner.
 void write(org.archive.modules.CrawlURI curi, String ip, org.archive.io.RecordingOutputStream recordingOutputStream, org.archive.io.RecordingInputStream recordingInputStream)
          Write the crawled output to the configured HBase table.
 
Methods inherited from class org.archive.io.WriterPoolMember
checkSize, checkWriteable, close, copyFrom, createFile, createFile, flush, generateNewBasename, getBaseFilename, getFile, getNextDirectory, getOutputStream, getPosition, isCompressed, isOversize, postWriteRecordTasks, preWriteRecordTasks, write, write, write
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HBaseWriter

public HBaseWriter(AtomicInteger serialNo,
                   org.archive.io.WriterPoolSettings settings,
                   HBaseParameters parameters)
            throws IOException
Instantiates a new h base writer.

Parameters:
serialNo - the serial no
settings - the settings
parameters - the parameters
Throws:
IOException - Signals that an I/O exception has occurred.
Method Detail

getHbaseOptions

public HBaseParameters getHbaseOptions()
Gets the hbase options.

Returns:
the hbase options
See Also:
HBaseParameters

getClient

public org.apache.hadoop.hbase.client.HTable getClient()
Gets the HTable client.

Returns:
the client

initializeCrawlTable

protected void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
                                    String hbaseTableName)
                             throws IOException
Creates the crawl table in HBase.

Parameters:
hbaseConfiguration - the c
hbaseTableName - the table
Throws:
IOException - Signals that an I/O exception has occurred.

getByteArrayFromInputStream

protected byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
                                             int streamSize)
                                      throws IOException
Read the ReplayInputStream and write it to the given BatchUpdate with the given column.

Parameters:
replayInputStream - the ris the cell data as a replay input stream
streamSize - the size
Returns:
the byte array from input stream
Throws:
IOException - Signals that an I/O exception has occurred.

processContent

protected void processContent(org.apache.hadoop.hbase.client.Put put,
                              org.archive.io.ReplayInputStream replayInputStream,
                              int streamSize)
                       throws IOException
This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. For Example : html parsing, text extraction, analysis and transformation and storing the results in new column families/columns using the batch update object. Or even saving the values in other custom hbase tables or other remote data sources. (a.k.a. anything you want)

Parameters:
put - the stateful put object containing all the row data to be written.
replayInputStream - the replay input stream containing the raw content gotten by heritrix crawler.
streamSize - the stream size
Throws:
IOException - Signals that an I/O exception has occurred.

write

public void write(org.archive.modules.CrawlURI curi,
                  String ip,
                  org.archive.io.RecordingOutputStream recordingOutputStream,
                  org.archive.io.RecordingInputStream recordingInputStream)
           throws IOException
Write the crawled output to the configured HBase table. Write each row key as the url with reverse domain and optionally process any content.

Parameters:
curi - URI of crawled document
ip - IP of remote machine.
recordingOutputStream - recording input stream that captured the response
recordingInputStream - recording output stream that captured the GET request
Throws:
IOException - Signals that an I/O exception has occurred.

serialize

public byte[] serialize(byte[] bytes)
Description copied from interface: Serializer
Implement if you want to serialize bytes in a custom manner.

Specified by:
serialize in interface Serializer
Parameters:
bytes - the bytes
Returns:
serialized bytes


Copyright © 2007-2012. All Rights Reserved.