|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.archive.io.hbase.HBaseWriter
public class HBaseWriter
HBase implementation.
| Constructor Summary | |
|---|---|
HBaseWriter(String zkQuorum,
int zkClientPort,
String tableName,
HBaseParameters parameters)
Instantiates a new HBaseWriter for the WriterPool to use in heritrix. |
|
| Method Summary | |
|---|---|
protected byte[] |
getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
Read the ReplayInputStream and write it to the given BatchUpdate with the given column. |
org.apache.hadoop.hbase.client.HTable |
getClient()
Gets the HTable client. |
HBaseParameters |
getHbaseOptions()
|
protected void |
initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
String hbaseTableName)
Creates the crawl table in HBase. |
protected void |
processContent(org.apache.hadoop.hbase.client.Put put,
org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
This is a stub method and is here to allow extension/overriding for custom content parsing, data manipulation and to populate new columns. |
void |
write(org.archive.modules.CrawlURI curi,
String ip,
org.archive.io.RecordingOutputStream recordingOutputStream,
org.archive.io.RecordingInputStream recordingInputStream)
Write the crawled output to the configured HBase table. |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Constructor Detail |
|---|
public HBaseWriter(String zkQuorum,
int zkClientPort,
String tableName,
HBaseParameters parameters)
throws IOException
zkQuorum - the zookeeper quorum. The list of hosts that make up you zookeeper quorum.
i.e.: zkHost1,zkHost2,zkHost3zkClientPort - the zookeeper client port that clients should try to connect on for
servers in the zk quorum. This value is analgous to the hase-site.xml config parameter:
hbase.zookeeper.property.clientPorttableName - the table in hbase to write to. i.e. : webtableparameters - an HBaseParameters object consisting of parameters list
IOException - Signals that an I/O exception has occurred.| Method Detail |
|---|
public HBaseParameters getHbaseOptions()
HBaseParameterspublic org.apache.hadoop.hbase.client.HTable getClient()
protected void initializeCrawlTable(org.apache.hadoop.conf.Configuration hbaseConfiguration,
String hbaseTableName)
throws IOException
hbaseConfiguration - the chbaseTableName - the table
IOException - Signals that an I/O exception has occurred.
protected byte[] getByteArrayFromInputStream(org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
throws IOException
replayInputStream - the ris the cell data as a replay input streamstreamSize - the size
IOException - Signals that an I/O exception has occurred.
protected void processContent(org.apache.hadoop.hbase.client.Put put,
org.archive.io.ReplayInputStream replayInputStream,
int streamSize)
throws IOException
put - the stateful put object containing all the row data to be written.replayInputStream - the replay input stream containing the raw content gotten by heritrix crawler.streamSize - the stream size
IOException - Signals that an I/O exception has occurred.
public void write(org.archive.modules.CrawlURI curi,
String ip,
org.archive.io.RecordingOutputStream recordingOutputStream,
org.archive.io.RecordingInputStream recordingInputStream)
throws IOException
curi - URI of crawled documentip - IP of remote machine.recordingOutputStream - recording input stream that captured the responserecordingInputStream - recording output stream that captured the GET request
IOException - Signals that an I/O exception has occurred.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||