public class HBaseWriter extends org.archive.io.WriterPoolMember implements Serializer
countOut, currentBasename, currentTimestamp, DEFAULT_PREFIX, DEFAULT_TEMPLATE, f, out, rebuf, roundRobinIndex, scratchbuffer, serialNoFormatter, settings, UTF8ABSOLUTE_OFFSET_KEY, CDX, CDX_FILE, CDX_LINE_BUFFER_SIZE, CRLF, DATE_FIELD_KEY, DEFAULT_DIGEST_METHOD, DOT_COMPRESSED_FILE_EXTENSION, DUMP, GZIP_DUMP, HEADER, INVALID_SUFFIX, LENGTH_FIELD_KEY, MIMETYPE_FIELD_KEY, NOHEAD, OCCUPIED_SUFFIX, READER_IDENTIFIER_FIELD_KEY, RECORD_IDENTIFIER_FIELD_KEY, SINGLE_SPACE, TYPE_FIELD_KEY, URL_FIELD_KEY, VERSION_FIELD_KEY| Constructor and Description |
|---|
HBaseWriter(AtomicInteger serialNo,
org.archive.io.WriterPoolSettings settings,
HBaseParameters parameters)
Instantiates a new h base writer.
|
| Modifier and Type | Method and Description |
|---|---|
static String |
createRowKeyFromUrl(String url)
This is a stub method and is here to allow extension/overriding for
custom content parsing, data manipulation and to populate new columns.
|
static String |
createUrlFromRowKey(String rowKey) |
static byte[] |
getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
Read the ReplayInputStream and write it to the given BatchUpdate with the
given column.
|
HBaseParameters |
getHbaseParameters()
Gets the hbase options.
|
org.apache.hadoop.hbase.client.HTable |
getHTable()
Gets the HTable client.
|
protected void |
initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
String hbaseTableName)
Creates the crawl table in HBase.
|
byte[] |
serialize(byte[] bytes)
Implement if you want to serialize bytes in a custom manner.
|
void |
write(HBaseWriterProcessor hBaseWriterProcessor,
org.archive.modules.CrawlURI curi,
String ip,
org.archive.io.RecordingOutputStream recordingOutputStream,
org.archive.io.RecordingInputStream recordingInputStream,
long recordedSize)
Write the crawled output to the configured HBase table.
|
checkSize, checkWriteable, close, copyFrom, createFile, createFile, flush, generateNewBasename, getBaseFilename, getFile, getNextDirectory, getOutputStream, getPosition, isCompressed, isOversize, postWriteRecordTasks, preWriteRecordTasks, write, write, writepublic HBaseWriter(AtomicInteger serialNo, org.archive.io.WriterPoolSettings settings, HBaseParameters parameters) throws IOException
serialNo - the serial nosettings - the settingsparameters - the parametersIOException - Signals that an I/O exception has occurred.public HBaseParameters getHbaseParameters()
HBaseParameterspublic org.apache.hadoop.hbase.client.HTable getHTable()
protected void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
String hbaseTableName)
throws IOException
hbaseConfiguration - the chbaseTableName - the tableIOException - Signals that an I/O exception has occurred.public static byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
throws IOException
replayInputStream - the ris the cell data as a replay input streamstreamSize - the sizeIOException - Signals that an I/O exception has occurred.public static String createRowKeyFromUrl(String url)
put - the stateful put object containing all the row data to be
written.replayInputStream - the replay input stream containing the raw content gotten by
heritrix crawler.streamSize - the stream sizeIOException - Signals that an I/O exception has occurred.public void write(HBaseWriterProcessor hBaseWriterProcessor, org.archive.modules.CrawlURI curi, String ip, org.archive.io.RecordingOutputStream recordingOutputStream, org.archive.io.RecordingInputStream recordingInputStream, long recordedSize) throws IOException
curi - URI of crawled documentip - IP of remote machine.recordingOutputStream - recording input stream that captured the responserecordingInputStream - recording output stream that captured the GET requestIOException - Signals that an I/O exception has occurred.public byte[] serialize(byte[] bytes)
Serializerserialize in interface Serializerbytes - the bytesCopyright © 2007–2014. All rights reserved.