com.powerset.heritrix.writer
Class HBaseWriterProcessor

java.lang.Object
  extended by org.archive.modules.Processor
      extended by com.powerset.heritrix.writer.HBaseWriterProcessor
All Implemented Interfaces:
Closeable, Serializable, org.archive.state.Initializable, org.archive.state.Module

public class HBaseWriterProcessor
extends org.archive.modules.Processor
implements org.archive.state.Initializable, Closeable

An heritrix2 processor that writes to Hadoop HBase.

See Also:
Serialized Form

Field Summary
static org.archive.state.Key<Integer> CONTENT_MAX_SIZE
          Maximum allowable content size.
static org.archive.state.Key<Integer> POOL_MAX_ACTIVE
          Maximum active files in pool.
static org.archive.state.Key<Integer> POOL_MAX_WAIT
          Maximum time to wait on pool element (milliseconds).
static org.archive.state.Key<Boolean> PROCESS_ONLY_NEW_RECORDS
          If set to true, then only process urls that are new rowkey records.
static org.archive.state.Key<org.archive.modules.net.ServerCache> SERVER_CACHE
          The Constant SERVER_CACHE.
static org.archive.state.Key<String> TABLE
          HBase tableName to crawl into.
static org.archive.state.Key<Long> TOTAL_BYTES_TO_WRITE
          Total file bytes to write to disk.
static org.archive.state.Key<Boolean> WRITE_ONLY_NEW_RECORDS
          If set to true, then only write urls that are new rowkey records.
static org.archive.state.Key<Integer> ZKCLIENTPORT
          The port that clients should connect on to contact their zk quorum hsots.
static org.archive.state.Key<String> ZKQUORUM
          Commas-seperated list of Hostnames in the zookeeper quorum.
 
Fields inherited from class org.archive.modules.Processor
DECIDE_RULES, ENABLED
 
Constructor Summary
HBaseWriterProcessor()
          Instantiates a new HBaseWriterProcessor.
 
Method Summary
protected  org.archive.modules.ProcessResult checkBytesWritten(org.archive.state.StateProvider context)
          Check bytes written.
 void close()
           
protected  String getHostAddress(org.archive.modules.ProcessorURI curi)
          Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).
protected  int getMaxActive()
          Gets the max active.
protected  int getMaxWait()
          Gets the max wait.
protected  org.archive.io.WriterPool getPool()
          Gets the pool.
protected  String getTable()
          Gets the table.
protected  long getTotalBytesWritten()
          Gets the total bytes written.
protected  int getZKClientPort()
          Gets the zookeeper client port.
protected  String getZKQuorum()
          Gets the zookeeper quorum.
 void initialTasks(org.archive.state.StateProvider context)
           
protected  void innerProcess(org.archive.modules.ProcessorURI puri)
           
protected  org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.ProcessorURI puri)
           
protected  void setPool(org.archive.io.WriterPool pool)
          Sets the pool.
protected  void setTotalBytesWritten(long b)
          Sets the total bytes written.
protected  void setupPool()
          Setup pool.
protected  boolean shouldProcess(org.archive.modules.ProcessorURI uri)
           
protected  boolean shouldWrite(org.archive.modules.ProcessorURI curi)
          Whether the given ProcessorURI should be written to archive files.
protected  org.archive.modules.ProcessResult write(org.archive.modules.ProcessorURI curi, long recordLength, InputStream in, String ip)
          Write.
 
Methods inherited from class org.archive.modules.Processor
flattenVia, getRecordedSize, getURICount, hasRfc2617CredentialAvatar, innerRejectProcess, isSuccess, process, report
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ZKQUORUM

@Immutable
public static final org.archive.state.Key<String> ZKQUORUM
Commas-seperated list of Hostnames in the zookeeper quorum.


ZKCLIENTPORT

@Immutable
public static final org.archive.state.Key<Integer> ZKCLIENTPORT
The port that clients should connect on to contact their zk quorum hsots.


TABLE

@Immutable
public static final org.archive.state.Key<String> TABLE
HBase tableName to crawl into.


WRITE_ONLY_NEW_RECORDS

@Immutable
public static final org.archive.state.Key<Boolean> WRITE_ONLY_NEW_RECORDS
If set to true, then only write urls that are new rowkey records. Default is false, which will write all urls to the HBase table. Heritrix is good about not hitting the same url twice, so this feature is to ensure that you can run multiple sessions of the same crawl configuration and not write the same url more than once to the same hbase table. You may just want to crawl a site to see what new urls have been added over time, or continue where you left off on a terminated crawl. Heritrix itself does support this functionalty by supporting "Checkpoints" during a crawl session, so this may not be a necessary option.


PROCESS_ONLY_NEW_RECORDS

@Immutable
public static final org.archive.state.Key<Boolean> PROCESS_ONLY_NEW_RECORDS
If set to true, then only process urls that are new rowkey records. Default is false, which will process all urls to the HBase table. In this mode, Heritrix wont even fetch and parse the content served at the url if it already exists as a rowkey in the HBase table.


POOL_MAX_ACTIVE

@Immutable
public static final org.archive.state.Key<Integer> POOL_MAX_ACTIVE
Maximum active files in pool. This setting cannot be varied over the life of a crawl.


POOL_MAX_WAIT

@Immutable
public static final org.archive.state.Key<Integer> POOL_MAX_WAIT
Maximum time to wait on pool element (milliseconds). This setting cannot be varied over the life of a crawl.


SERVER_CACHE

@Immutable
public static final org.archive.state.Key<org.archive.modules.net.ServerCache> SERVER_CACHE
The Constant SERVER_CACHE.


CONTENT_MAX_SIZE

@Immutable
public static final org.archive.state.Key<Integer> CONTENT_MAX_SIZE
Maximum allowable content size.


TOTAL_BYTES_TO_WRITE

@Immutable
@Expert
public static final org.archive.state.Key<Long> TOTAL_BYTES_TO_WRITE
Total file bytes to write to disk. Once the size of all files on disk has exceeded this limit, this processor will stop the crawler. A value of zero means no upper limit.

Constructor Detail

HBaseWriterProcessor

public HBaseWriterProcessor()
Instantiates a new HBaseWriterProcessor.

Method Detail

initialTasks

public void initialTasks(org.archive.state.StateProvider context)
Specified by:
initialTasks in interface org.archive.state.Initializable

innerProcessResult

protected org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.ProcessorURI puri)
Overrides:
innerProcessResult in class org.archive.modules.Processor

getHostAddress

protected String getHostAddress(org.archive.modules.ProcessorURI curi)
Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).

Parameters:
curi - ProcessorURI
Returns:
String of IP address

shouldProcess

protected boolean shouldProcess(org.archive.modules.ProcessorURI uri)
Specified by:
shouldProcess in class org.archive.modules.Processor

shouldWrite

protected boolean shouldWrite(org.archive.modules.ProcessorURI curi)
Whether the given ProcessorURI should be written to archive files. Annotates ProcessorURI with a reason for any negative answer.

Parameters:
curi - ProcessorURI
Returns:
true if URI should be written; false otherwise

write

protected org.archive.modules.ProcessResult write(org.archive.modules.ProcessorURI curi,
                                                  long recordLength,
                                                  InputStream in,
                                                  String ip)
                                           throws IOException
Write.

Parameters:
curi - the curi
recordLength - the record length
in - the in
ip - the ip
Returns:
the process result
Throws:
IOException - Signals that an I/O exception has occurred.

checkBytesWritten

protected org.archive.modules.ProcessResult checkBytesWritten(org.archive.state.StateProvider context)
Check bytes written.

Parameters:
context - the context
Returns:
the process result

setupPool

protected void setupPool()
Setup pool.


getZKQuorum

protected String getZKQuorum()
Gets the zookeeper quorum.

Returns:
the zkQuorum

getZKClientPort

protected int getZKClientPort()
Gets the zookeeper client port.

Returns:
the zlClientPort

getTable

protected String getTable()
Gets the table.

Returns:
the table

getMaxActive

protected int getMaxActive()
Gets the max active.

Returns:
the max active

getMaxWait

protected int getMaxWait()
Gets the max wait.

Returns:
the max wait

setPool

protected void setPool(org.archive.io.WriterPool pool)
Sets the pool.

Parameters:
pool - the new pool

getPool

protected org.archive.io.WriterPool getPool()
Gets the pool.

Returns:
the pool

getTotalBytesWritten

protected long getTotalBytesWritten()
Gets the total bytes written.

Returns:
the total bytes written

setTotalBytesWritten

protected void setTotalBytesWritten(long b)
Sets the total bytes written.

Parameters:
b - the new total bytes written

innerProcess

protected void innerProcess(org.archive.modules.ProcessorURI puri)
Specified by:
innerProcess in class org.archive.modules.Processor

close

public void close()
Specified by:
close in interface Closeable


Copyright © 2007-2009. All Rights Reserved.