com.powerset.heritrix.writer
Class HBaseWriterProcessor

java.lang.Object
  extended by org.archive.modules.Processor
      extended by com.powerset.heritrix.writer.HBaseWriterProcessor
All Implemented Interfaces:
java.io.Closeable, java.io.Serializable, org.archive.state.Initializable, org.archive.state.Module

public class HBaseWriterProcessor
extends org.archive.modules.Processor
implements org.archive.state.Initializable, java.io.Closeable

An heritrix2 processor that writes to Hadoop HBase.

Author:
stack
See Also:
Serialized Form

Field Summary
static org.archive.state.Key<java.lang.String> MASTER
          Location of hbase master.
static org.archive.state.Key<java.lang.Integer> POOL_MAX_ACTIVE
          Maximum active files in pool.
static org.archive.state.Key<java.lang.Integer> POOL_MAX_WAIT
          Maximum time to wait on pool element (milliseconds).
static org.archive.state.Key<org.archive.modules.net.ServerCache> SERVER_CACHE
           
static org.archive.state.Key<java.lang.String> TABLE
          HBase table to crawl into.
static org.archive.state.Key<java.lang.Long> TOTAL_BYTES_TO_WRITE
          Total file bytes to write to disk.
 
Fields inherited from class org.archive.modules.Processor
DECIDE_RULES, ENABLED
 
Constructor Summary
HBaseWriterProcessor()
           
 
Method Summary
protected  org.archive.modules.ProcessResult checkBytesWritten(org.archive.state.StateProvider context)
           
 void close()
           
protected  java.lang.String getHostAddress(org.archive.modules.ProcessorURI curi)
          Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).
protected  java.lang.String getMaster()
           
protected  int getMaxActive()
           
protected  int getMaxWait()
           
protected  org.archive.io.WriterPool getPool()
           
protected  java.lang.String getTable()
           
protected  long getTotalBytesWritten()
           
 void initialTasks(org.archive.state.StateProvider context)
           
protected  void innerProcess(org.archive.modules.ProcessorURI puri)
           
protected  org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.ProcessorURI puri)
           
protected  void setPool(org.archive.io.WriterPool pool)
           
protected  void setTotalBytesWritten(long b)
           
protected  void setupPool()
           
protected  boolean shouldProcess(org.archive.modules.ProcessorURI uri)
           
protected  boolean shouldWrite(org.archive.modules.ProcessorURI curi)
          Whether the given ProcessorURI should be written to archive files.
protected  org.archive.modules.ProcessResult write(org.archive.modules.ProcessorURI curi, long recordLength, java.io.InputStream in, java.lang.String ip)
           
 
Methods inherited from class org.archive.modules.Processor
flattenVia, getRecordedSize, getURICount, hasRfc2617CredentialAvatar, innerRejectProcess, isSuccess, process, report
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MASTER

@Immutable
public static final org.archive.state.Key<java.lang.String> MASTER
Location of hbase master.


TABLE

@Immutable
public static final org.archive.state.Key<java.lang.String> TABLE
HBase table to crawl into.


POOL_MAX_ACTIVE

@Immutable
public static final org.archive.state.Key<java.lang.Integer> POOL_MAX_ACTIVE
Maximum active files in pool. This setting cannot be varied over the life of a crawl.


POOL_MAX_WAIT

@Immutable
public static final org.archive.state.Key<java.lang.Integer> POOL_MAX_WAIT
Maximum time to wait on pool element (milliseconds). This setting cannot be varied over the life of a crawl.


SERVER_CACHE

@Immutable
public static final org.archive.state.Key<org.archive.modules.net.ServerCache> SERVER_CACHE

TOTAL_BYTES_TO_WRITE

@Immutable
@Expert
public static final org.archive.state.Key<java.lang.Long> TOTAL_BYTES_TO_WRITE
Total file bytes to write to disk. Once the size of all files on disk has exceeded this limit, this processor will stop the crawler. A value of zero means no upper limit.

Constructor Detail

HBaseWriterProcessor

public HBaseWriterProcessor()
Method Detail

initialTasks

public void initialTasks(org.archive.state.StateProvider context)
Specified by:
initialTasks in interface org.archive.state.Initializable

getMaster

protected java.lang.String getMaster()

getTable

protected java.lang.String getTable()

setupPool

protected void setupPool()

getMaxActive

protected int getMaxActive()

getMaxWait

protected int getMaxWait()

setPool

protected void setPool(org.archive.io.WriterPool pool)

getPool

protected org.archive.io.WriterPool getPool()

getTotalBytesWritten

protected long getTotalBytesWritten()

setTotalBytesWritten

protected void setTotalBytesWritten(long b)

innerProcessResult

protected org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.ProcessorURI puri)
Overrides:
innerProcessResult in class org.archive.modules.Processor

getHostAddress

protected java.lang.String getHostAddress(org.archive.modules.ProcessorURI curi)
Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).

Parameters:
curi - ProcessorURI
Returns:
String of IP address

shouldWrite

protected boolean shouldWrite(org.archive.modules.ProcessorURI curi)
Whether the given ProcessorURI should be written to archive files. Annotates ProcessorURI with a reason for any negative answer.

Parameters:
curi - ProcessorURI
Returns:
true if URI should be written; false otherwise

write

protected org.archive.modules.ProcessResult write(org.archive.modules.ProcessorURI curi,
                                                  long recordLength,
                                                  java.io.InputStream in,
                                                  java.lang.String ip)
                                           throws java.io.IOException
Throws:
java.io.IOException

checkBytesWritten

protected org.archive.modules.ProcessResult checkBytesWritten(org.archive.state.StateProvider context)

innerProcess

protected void innerProcess(org.archive.modules.ProcessorURI puri)
Specified by:
innerProcess in class org.archive.modules.Processor

close

public void close()
Specified by:
close in interface java.io.Closeable

shouldProcess

protected boolean shouldProcess(org.archive.modules.ProcessorURI uri)
Specified by:
shouldProcess in class org.archive.modules.Processor


Copyright © 2007-2008. All Rights Reserved.