com.powerset.heritrix.writer
Class HBaseWriter
java.lang.Object
org.archive.io.WriterPoolMember
com.powerset.heritrix.writer.HBaseWriter
- All Implemented Interfaces:
- org.archive.io.ArchiveFileConstants
public class HBaseWriter
- extends org.archive.io.WriterPoolMember
- implements org.archive.io.ArchiveFileConstants
Write to HBase. Puts content into the 'content:' column and all else into the
'curi:' column family. Makes a row key of an url transformation. Creates
table if it does not exist.
Limitations: Hard-coded table schema.
| Fields inherited from class org.archive.io.WriterPoolMember |
DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_VARIABLE, UTF8 |
| Fields inherited from interface org.archive.io.ArchiveFileConstants |
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
|
Constructor Summary |
HBaseWriter(java.lang.String master,
java.lang.String table)
|
|
Method Summary |
protected void |
createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration c,
java.lang.String table)
|
org.apache.hadoop.hbase.client.HTable |
getClient()
|
protected void |
processContent(org.apache.hadoop.hbase.io.BatchUpdate bu)
This is a stub method and is here to allow extension/overriding for
custom content parsing, data manipulation and to populate new columns. |
void |
write(org.archive.modules.ProcessorURI curi,
java.lang.String ip,
org.archive.io.RecordingOutputStream ros,
org.archive.io.RecordingInputStream ris)
|
| Methods inherited from class org.archive.io.WriterPoolMember |
checkSize, checkWriteable, close, copyFrom, createFile, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
CONTENT_COLUMN_FAMILY
public static final java.lang.String CONTENT_COLUMN_FAMILY
- See Also:
- Constant Field Values
CONTENT_COLUMN
public static final java.lang.String CONTENT_COLUMN
- See Also:
- Constant Field Values
CURI_COLUMN_FAMILY
public static final java.lang.String CURI_COLUMN_FAMILY
- See Also:
- Constant Field Values
HBaseWriter
public HBaseWriter(java.lang.String master,
java.lang.String table)
throws java.io.IOException
- Throws:
java.io.IOException
getClient
public org.apache.hadoop.hbase.client.HTable getClient()
createCrawlTable
protected void createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration c,
java.lang.String table)
throws java.io.IOException
- Throws:
java.io.IOException
write
public void write(org.archive.modules.ProcessorURI curi,
java.lang.String ip,
org.archive.io.RecordingOutputStream ros,
org.archive.io.RecordingInputStream ris)
throws java.io.IOException
- Parameters:
curi - URI of crawled documentip - IP of remote machine.ros - recording input stream that captured the responseris - recording output stream that captured the GET request
- Throws:
java.io.IOException
processContent
protected void processContent(org.apache.hadoop.hbase.io.BatchUpdate bu)
- This is a stub method and is here to allow extension/overriding for
custom content parsing, data manipulation and to populate new columns.
For Example : html parsing, text extraction, analysis and transformation
and storing the results in new column families/columns using the batch
update object.
- Parameters:
bu -
Copyright © 2007-2009. All Rights Reserved.