|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.archive.modules.Processor
org.archive.modules.writer.WriterPoolProcessor
org.archive.modules.writer.HBaseWriterProcessor
public class HBaseWriterProcessor
A Heritrix 3 processor that writes to Hadoop HBase. The following example shows how to configure the crawl job configuration.
<bean id="hbaseParameterSettings" class="org.archive.io.hbase.HBaseParameters">
<!-- These settings are required -->
<property name="zkQuorum" value="localhost" />
<property name="hbaseTableName" value="crawl" />
<!-- This should reflect your installation, but 2181 is the default -->
<property name="zkPort" value="2181" />
<!-- All other settings are optional -->
<property name="onlyProcessNewRecords" value="false" />
<property name="onlyWriteNewRecords" value="false" />
<property name="contentColumnFamily" value="newcontent" />
<property name="defaultMaxFileSizeInBytes" value="26214400" />
<!-- 25 * 1024 * 1024 = 26214400 bytes -->
<!-- Overwrite more options here -->
</bean>
<bean id="hbaseWriterProcessor" class="org.archive.modules.writer.HBaseWriterProcessor">
<property name="hbaseParameters">
<ref bean="hbaseParameterSettings"/>
</property>
</bean>
<bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">
<property name="processors">
<list>
<ref bean="hbaseWriterProcessor"/>
<!-- other references -->
</list>
</property>
</bean>
{@link org.archive.io.hbase.HBaseParameters} for defining
hbaseParameters| Field Summary |
|---|
| Fields inherited from class org.archive.modules.writer.WriterPoolProcessor |
|---|
ANNOTATION_UNWRITTEN, directory, frequentFlushes, serverCache, writeBufferSize |
| Fields inherited from class org.archive.modules.Processor |
|---|
kp, recoveryCheckpoint, uriCount |
| Constructor Summary | |
|---|---|
HBaseWriterProcessor()
|
|
| Method Summary | |
|---|---|
HBaseParameters |
getHbaseParameters()
Gets the hbase parameters. |
List<String> |
getMetadata()
|
org.archive.uid.RecordIDGenerator |
getRecordIDGenerator()
|
protected org.archive.modules.ProcessResult |
innerProcessResult(org.archive.modules.CrawlURI uri)
|
void |
setHbaseParameters(HBaseParameters options)
Sets the hbase parameters. |
protected void |
setupPool(AtomicInteger serial)
|
protected boolean |
shouldProcess(org.archive.modules.CrawlURI curi)
|
protected boolean |
shouldWrite(org.archive.modules.CrawlURI curi)
Whether the given CrawlURI should be written to archive files. |
protected org.archive.modules.ProcessResult |
write(org.archive.modules.CrawlURI curi,
long recordLength,
InputStream in)
Write to HBase. |
| Methods inherited from class org.archive.modules.writer.WriterPoolProcessor |
|---|
calcOutputDirs, checkBytesWritten, copyForwardWriteTagIfDupe, doCheckpoint, fromCheckpointJson, getCompress, getDirectory, getFrequentFlushes, getHostAddress, getMaxFileSizeBytes, getMaxTotalBytesToWrite, getMaxWaitForIdleMs, getMetadataProvider, getPool, getPoolMaxActive, getPrefix, getSerialNo, getServerCache, getSkipIdenticalDigests, getStorePaths, getTemplate, getTotalBytesWritten, getWriteBufferSize, innerProcess, innerRejectProcess, setCompress, setDirectory, setFrequentFlushes, setMaxFileSizeBytes, setMaxTotalBytesToWrite, setMaxWaitForIdleMs, setMetadataProvider, setPool, setPoolMaxActive, setPrefix, setServerCache, setSkipIdenticalDigests, setStorePaths, setTemplate, setTotalBytesWritten, setWriteBufferSize, start, stop, toCheckpointJson |
| Methods inherited from class org.archive.modules.Processor |
|---|
finishCheckpoint, flattenVia, getBeanName, getEnabled, getKeyedProperties, getRecordedSize, getShouldProcessRule, getURICount, hasHttpAuthenticationCredential, isRunning, isSuccess, process, report, setBeanName, setEnabled, setRecoveryCheckpoint, setShouldProcessRule, startCheckpoint |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Methods inherited from interface org.archive.io.WriterPoolSettings |
|---|
calcOutputDirs, getCompress, getFrequentFlushes, getMaxFileSizeBytes, getPrefix, getTemplate, getWriteBufferSize |
| Methods inherited from interface org.springframework.context.Lifecycle |
|---|
isRunning |
| Methods inherited from interface org.archive.checkpointing.Checkpointable |
|---|
finishCheckpoint, setRecoveryCheckpoint, startCheckpoint |
| Constructor Detail |
|---|
public HBaseWriterProcessor()
| Method Detail |
|---|
public HBaseParameters getHbaseParameters()
public void setHbaseParameters(HBaseParameters options)
options - the new hbase parametersprotected void setupPool(AtomicInteger serial)
setupPool in class org.archive.modules.writer.WriterPoolProcessorprotected org.archive.modules.ProcessResult innerProcessResult(org.archive.modules.CrawlURI uri)
innerProcessResult in class org.archive.modules.writer.WriterPoolProcessorprotected boolean shouldProcess(org.archive.modules.CrawlURI curi)
shouldProcess in class org.archive.modules.writer.WriterPoolProcessorprotected boolean shouldWrite(org.archive.modules.CrawlURI curi)
shouldWrite in class org.archive.modules.writer.WriterPoolProcessorcuri - CrawlURI
protected org.archive.modules.ProcessResult write(org.archive.modules.CrawlURI curi,
long recordLength,
InputStream in)
throws IOException
curi - the curirecordLength - the record lengthin - the in
IOException - Signals that an I/O exception has occurred.public List<String> getMetadata()
getMetadata in interface org.archive.io.WriterPoolSettingsgetMetadata in class org.archive.modules.writer.WriterPoolProcessorpublic org.archive.uid.RecordIDGenerator getRecordIDGenerator()
getRecordIDGenerator in interface org.archive.io.warc.WARCWriterPoolSettings
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||