com.powerset.heritrix.writer
Class HBaseWriterProcessor

java.lang.Object
  extended by org.archive.modules.Processor
      extended by com.powerset.heritrix.writer.HBaseWriterProcessor
All Implemented Interfaces:
java.io.Closeable, java.io.Serializable, org.archive.state.Initializable, org.archive.state.Module

public class HBaseWriterProcessor
extends org.archive.modules.Processor
implements org.archive.state.Initializable, java.io.Closeable

An heritrix2 processor that writes to Hadoop HBase.

See Also:
Serialized Form

Field Summary
static org.archive.state.Key<java.lang.Integer> CONTENT_MAX_SIZE
          Maximum allowable content size.
static org.archive.state.Key<java.lang.String> MASTER
          Location of hbase master.
static org.archive.state.Key<java.lang.Integer> POOL_MAX_ACTIVE
          Maximum active files in pool.
static org.archive.state.Key<java.lang.Integer> POOL_MAX_WAIT
          Maximum time to wait on pool element (milliseconds).
static org.archive.state.Key<java.lang.Boolean> PROCESS_ONLY_NEW_RECORDS
          If set to true, then only fetch & process urls that are new rowkey records.
static org.archive.state.Key<org.archive.modules.net.ServerCache> SERVER_CACHE
          The Constant SERVER_CACHE.
static org.archive.state.Key<java.lang.String> TABLE
          HBase tableName to crawl into.
static org.archive.state.Key<java.lang.Long> TOTAL_BYTES_TO_WRITE
          Total file bytes to write to disk.
static org.archive.state.Key<java.lang.Boolean> WRITE_ONLY_NEW_RECORDS
          If set to true, then only write urls that are new rowkey records.
 
Fields inherited from class org.archive.modules.Processor
DECIDE_RULES, ENABLED
 
Constructor Summary
HBaseWriterProcessor()
          Instantiates a new h base writer processor.
 
Method Summary
protected  org.archive.modules.ProcessResult checkBytesWritten(org.archive.state.StateProvider context)
          Check bytes written.
 void close()
           
protected  java.lang.String getHostAddress(org.archive.modules.ProcessorURI curi)
          Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).
protected  java.lang.String getMaster()
          Gets the master.
protected  int getMaxActive()
          Gets the max active.
protected  int getMaxWait()
          Gets the max wait.
protected  org.archive.io.WriterPool getPool()
          Gets the pool.
protected  java.lang.String getTable()
          Gets the table.
protected  long getTotalBytesWritten()
          Gets the total bytes written.
 void initialTasks(org.archive.state.StateProvider context)
           
protected  void innerProcess(org.archive.modules.ProcessorURI puri)
           
protected  org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.ProcessorURI puri)
           
protected  void setPool(org.archive.io.WriterPool pool)
          Sets the pool.
protected  void setTotalBytesWritten(long b)
          Sets the total bytes written.
protected  void setupPool()
          Setup pool.
protected  boolean shouldProcess(org.archive.modules.ProcessorURI uri)
           
protected  boolean shouldWrite(org.archive.modules.ProcessorURI curi)
          Whether the given ProcessorURI should be written to archive files.
protected  org.archive.modules.ProcessResult write(org.archive.modules.ProcessorURI curi, long recordLength, java.io.InputStream in, java.lang.String ip)
          Write.
 
Methods inherited from class org.archive.modules.Processor
flattenVia, getRecordedSize, getURICount, hasRfc2617CredentialAvatar, innerRejectProcess, isSuccess, process, report
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

MASTER

@Immutable
public static final org.archive.state.Key<java.lang.String> MASTER
Location of hbase master.


TABLE

@Immutable
public static final org.archive.state.Key<java.lang.String> TABLE
HBase tableName to crawl into.


WRITE_ONLY_NEW_RECORDS

@Immutable
public static final org.archive.state.Key<java.lang.Boolean> WRITE_ONLY_NEW_RECORDS
If set to true, then only write urls that are new rowkey records. Default is false, which will write all urls to the HBase table. Heritrix is good about not hitting the same url twice during the same session, so this feature is to ensure that you can run multiple sessions of the same crawl configuration and not write the same url more than once to the given HBase table.


PROCESS_ONLY_NEW_RECORDS

@Immutable
public static final org.archive.state.Key<java.lang.Boolean> PROCESS_ONLY_NEW_RECORDS
If set to true, then only fetch & process urls that are new rowkey records. Default is false, which will process all urls to the HBase table. In this mode, Heritrix wont even download and traverse the url if it exists in the HBase table.


POOL_MAX_ACTIVE

@Immutable
public static final org.archive.state.Key<java.lang.Integer> POOL_MAX_ACTIVE
Maximum active files in pool. This setting cannot be varied over the life of a crawl.


POOL_MAX_WAIT

@Immutable
public static final org.archive.state.Key<java.lang.Integer> POOL_MAX_WAIT
Maximum time to wait on pool element (milliseconds). This setting cannot be varied over the life of a crawl.


SERVER_CACHE

@Immutable
public static final org.archive.state.Key<org.archive.modules.net.ServerCache> SERVER_CACHE
The Constant SERVER_CACHE.


CONTENT_MAX_SIZE

@Immutable
public static final org.archive.state.Key<java.lang.Integer> CONTENT_MAX_SIZE
Maximum allowable content size.


TOTAL_BYTES_TO_WRITE

@Immutable
@Expert
public static final org.archive.state.Key<java.lang.Long> TOTAL_BYTES_TO_WRITE
Total file bytes to write to disk. Once the size of all files on disk has exceeded this limit, this processor will stop the crawler. A value of zero means no upper limit.

Constructor Detail

HBaseWriterProcessor

public HBaseWriterProcessor()
Instantiates a new h base writer processor.

Method Detail

initialTasks

public void initialTasks(org.archive.state.StateProvider context)
Specified by:
initialTasks in interface org.archive.state.Initializable

getMaster

protected java.lang.String getMaster()
Gets the master.

Returns:
the master

getTable

protected java.lang.String getTable()
Gets the table.

Returns:
the table

setupPool

protected void setupPool()
Setup pool.


getMaxActive

protected int getMaxActive()
Gets the max active.

Returns:
the max active

getMaxWait

protected int getMaxWait()
Gets the max wait.

Returns:
the max wait

setPool

protected void setPool(org.archive.io.WriterPool pool)
Sets the pool.

Parameters:
pool - the new pool

getPool

protected org.archive.io.WriterPool getPool()
Gets the pool.

Returns:
the pool

getTotalBytesWritten

protected long getTotalBytesWritten()
Gets the total bytes written.

Returns:
the total bytes written

setTotalBytesWritten

protected void setTotalBytesWritten(long b)
Sets the total bytes written.

Parameters:
b - the new total bytes written

innerProcessResult

protected org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.ProcessorURI puri)
Overrides:
innerProcessResult in class org.archive.modules.Processor

getHostAddress

protected java.lang.String getHostAddress(org.archive.modules.ProcessorURI curi)
Return IP address of given URI suitable for recording (as in a classic ARC 5-field header line).

Parameters:
curi - ProcessorURI
Returns:
String of IP address

shouldWrite

protected boolean shouldWrite(org.archive.modules.ProcessorURI curi)
Whether the given ProcessorURI should be written to archive files. Annotates ProcessorURI with a reason for any negative answer.

Parameters:
curi - ProcessorURI
Returns:
true if URI should be written; false otherwise

write

protected org.archive.modules.ProcessResult write(org.archive.modules.ProcessorURI curi,
                                                  long recordLength,
                                                  java.io.InputStream in,
                                                  java.lang.String ip)
                                           throws java.io.IOException
Write.

Parameters:
curi - the curi
recordLength - the record length
in - the in
ip - the ip
Returns:
the process result
Throws:
java.io.IOException - Signals that an I/O exception has occurred.

checkBytesWritten

protected org.archive.modules.ProcessResult checkBytesWritten(org.archive.state.StateProvider context)
Check bytes written.

Parameters:
context - the context
Returns:
the process result

innerProcess

protected void innerProcess(org.archive.modules.ProcessorURI puri)
Specified by:
innerProcess in class org.archive.modules.Processor

close

public void close()
Specified by:
close in interface java.io.Closeable

shouldProcess

protected boolean shouldProcess(org.archive.modules.ProcessorURI uri)
Specified by:
shouldProcess in class org.archive.modules.Processor


Copyright © 2007-2009. All Rights Reserved.