|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||
java.lang.Objectorg.archive.io.WriterPoolMember
com.powerset.heritrix.writer.HBaseWriter
public class HBaseWriter
Write crawled content as records to an HBase table. Puts content into the 'content:raw_data' column and all else into the 'curi:' column family. Makes a row key of an url transformation. Creates table if it does not exist. The following is a complete list of columns that get written to by default: content:raw_data curi:ip curi:path-from-seed curi:is-seed curi:via curi:url curi:request
Limitations: Hard-coded table schema.
| Field Summary | |
|---|---|
static String |
CONTENT_COLUMN_FAMILY
The Constant CONTENT_COLUMN_FAMILY. |
static String |
CONTENT_COLUMN_NAME
The Constant CONTENT_COLUMN. |
static String |
CURI_COLUMN_FAMILY
The Constant CURI_COLUMN_FAMILY. |
| Fields inherited from class org.archive.io.WriterPoolMember |
|---|
DEFAULT_PREFIX, DEFAULT_SUFFIX, HOSTNAME_VARIABLE, UTF8 |
| Fields inherited from interface org.archive.io.ArchiveFileConstants |
|---|
ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, COMPRESSED_FILE_EXTENSION, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY |
| Constructor Summary | |
|---|---|
HBaseWriter(String zkQuorum,
int zkClientPort,
String tableName)
Instantiates a new HBaseWriter for the WriterPool to use in heritrix2. |
|
| Method Summary | |
|---|---|
protected void |
createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration hbaseConfiguration,
String hbaseTableName)
Creates the crawl table in HBase. |
protected byte[] |
getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
Read the ReplayInputStream and write it to the given BatchUpdate with the given column. |
org.apache.hadoop.hbase.client.HTable |
getClient()
Gets the HTable client. |
protected void |
processContent(org.apache.hadoop.hbase.client.Put put,
org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. |
void |
write(org.archive.modules.ProcessorURI curi,
String ip,
org.archive.io.RecordingOutputStream ros,
org.archive.io.RecordingInputStream ris)
Write the crawled output to the configured HBase table. |
| Methods inherited from class org.archive.io.WriterPoolMember |
|---|
checkSize, checkWriteable, close, copyFrom, createFile, createFile, flush, getBaseFilename, getCreateTimestamp, getFile, getNextDirectory, getOutputStream, getPosition, getTimestampSerialNo, getTimestampSerialNo, isCompressed, postWriteRecordTasks, preWriteRecordTasks, readFullyFrom, readToLimitFrom, write, write, write |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
public static final String CONTENT_COLUMN_FAMILY
public static final String CONTENT_COLUMN_NAME
public static final String CURI_COLUMN_FAMILY
| Constructor Detail |
|---|
public HBaseWriter(String zkQuorum,
int zkClientPort,
String tableName)
throws IOException
zkQuorum - the zookeeper quorum. The list of hosts that make up you zookeeper quorum.
i.e.: zkHost1,zkHost2,zkHost3tableName - the table in hbase to write to. i.e. : webtable
IOException - Signals that an I/O exception has occurred.| Method Detail |
|---|
public org.apache.hadoop.hbase.client.HTable getClient()
protected void createCrawlTable(org.apache.hadoop.hbase.HBaseConfiguration hbaseConfiguration,
String hbaseTableName)
throws IOException
hbaseConfiguration - the chbaseTableName - the table
IOException - Signals that an I/O exception has occurred.
public void write(org.archive.modules.ProcessorURI curi,
String ip,
org.archive.io.RecordingOutputStream ros,
org.archive.io.RecordingInputStream ris)
throws IOException
curi - URI of crawled documentip - IP of remote machine.ros - recording input stream that captured the responseris - recording output stream that captured the GET request
IOException - Signals that an I/O exception has occurred.
protected byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
throws IOException
column - the column for the given data.replayInputStream - the ris the cell data as a replay input streamstreamSize - the size
IOException - Signals that an I/O exception has occurred.
protected void processContent(org.apache.hadoop.hbase.client.Put put,
org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
throws IOException
batchUpdate - the batchUpdate - the hbase row object whose state can be manipulated
before the object is written.
IOException
|
|||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | ||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | ||||||||