com.powerset.heritrix.writer
Class HBaseWriter

java.lang.Object
  extended by org.archive.io.WriterPoolMember
      extended by com.powerset.heritrix.writer.HBaseWriter
All Implemented Interfaces:
org.archive.io.ArchiveFileConstants

public class HBaseWriter
extends org.archive.io.WriterPoolMember
implements org.archive.io.ArchiveFileConstants

Write to HBase. Puts content into the 'content:' column and all else into the 'curi:' column family. Makes a row key of an url transformation. Creates table if it does not exist.

Limitations: Hard-coded table schema.


Field Summary
static java.lang.String CONTENT_COLUMN
           
static java.lang.String CONTENT_COLUMN_FAMILY
           
static java.lang.String CURI_COLUMN_FAMILY
           
 
Fields inherited from class org.archive.io.WriterPoolMember
DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_VARIABLE, UTF8
 
Fields inherited from interface org.archive.io.ArchiveFileConstants
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY
 
Constructor Summary
HBaseWriter(java.lang.String master, java.lang.String table)
           
 
Method Summary
protected  void createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration c, java.lang.String table)
           
 org.apache.hadoop.hbase.client.HTable getClient()
           
protected  void processContent(org.apache.hadoop.hbase.io.BatchUpdate bu)
          This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns.
 void write(org.archive.modules.ProcessorURI curi, java.lang.String ip, org.archive.io.RecordingOutputStream ros, org.archive.io.RecordingInputStream ris)
           
 
Methods inherited from class org.archive.io.WriterPoolMember
checkSize, checkWriteable, close, copyFrom, createFile, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CONTENT_COLUMN_FAMILY

public static final java.lang.String CONTENT_COLUMN_FAMILY
See Also:
Constant Field Values

CONTENT_COLUMN

public static final java.lang.String CONTENT_COLUMN
See Also:
Constant Field Values

CURI_COLUMN_FAMILY

public static final java.lang.String CURI_COLUMN_FAMILY
See Also:
Constant Field Values
Constructor Detail

HBaseWriter

public HBaseWriter(java.lang.String master,
                   java.lang.String table)
            throws java.io.IOException
Throws:
java.io.IOException
Method Detail

getClient

public org.apache.hadoop.hbase.client.HTable getClient()

createCrawlTable

protected void createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration c,
                                java.lang.String table)
                         throws java.io.IOException
Throws:
java.io.IOException

write

public void write(org.archive.modules.ProcessorURI curi,
                  java.lang.String ip,
                  org.archive.io.RecordingOutputStream ros,
                  org.archive.io.RecordingInputStream ris)
           throws java.io.IOException
Parameters:
curi - URI of crawled document
ip - IP of remote machine.
ros - recording input stream that captured the response
ris - recording output stream that captured the GET request
Throws:
java.io.IOException

processContent

protected void processContent(org.apache.hadoop.hbase.io.BatchUpdate bu)
This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. For Example : html parsing, text extraction, analysis and transformation and storing the results in new column families/columns using the batch update object.

Parameters:
bu -


Copyright © 2007-2009. All Rights Reserved.