View Javadoc

1   /**
2    * 
3   		  GNU LESSER GENERAL PUBLIC LICENSE
4   		       Version 2.1, February 1999
5   
6    Copyright (C) 1991, 1999 Free Software Foundation, Inc.
7        59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
8    Everyone is permitted to copy and distribute verbatim copies
9    of this license document, but changing it is not allowed.
10  
11  [This is the first released version of the Lesser GPL.  It also counts
12   as the successor of the GNU Library Public License, version 2, hence
13   the version number 2.1.]
14  
15  			    Preamble
16  
17    The licenses for most software are designed to take away your
18  freedom to share and change it.  By contrast, the GNU General Public
19  Licenses are intended to guarantee your freedom to share and change
20  free software--to make sure the software is free for all its users.
21  
22    This license, the Lesser General Public License, applies to some
23  specially designated software packages--typically libraries--of the
24  Free Software Foundation and other authors who decide to use it.  You
25  can use it too, but we suggest you first think carefully about whether
26  this license or the ordinary General Public License is the better
27  strategy to use in any particular case, based on the explanations below.
28  
29    When we speak of free software, we are referring to freedom of use,
30  not price.  Our General Public Licenses are designed to make sure that
31  you have the freedom to distribute copies of free software (and charge
32  for this service if you wish); that you receive source code or can get
33  it if you want it; that you can change the software and use pieces of
34  it in new free programs; and that you are informed that you can do
35  these things.
36  
37    To protect your rights, we need to make restrictions that forbid
38  distributors to deny you these rights or to ask you to surrender these
39  rights.  These restrictions translate to certain responsibilities for
40  you if you distribute copies of the library or if you modify it.
41  
42    For example, if you distribute copies of the library, whether gratis
43  or for a fee, you must give the recipients all the rights that we gave
44  you.  You must make sure that they, too, receive or can get the source
45  code.  If you link other code with the library, you must provide
46  complete object files to the recipients, so that they can relink them
47  with the library after making changes to the library and recompiling
48  it.  And you must show them these terms so they know their rights.
49  
50    We protect your rights with a two-step method: (1) we copyright the
51  library, and (2) we offer you this license, which gives you legal
52  permission to copy, distribute and/or modify the library.
53  
54    To protect each distributor, we want to make it very clear that
55  there is no warranty for the free library.  Also, if the library is
56  modified by someone else and passed on, the recipients should know
57  that what they have is not the original version, so that the original
58  author's reputation will not be affected by problems that might be
59  introduced by others.
60  
61    Finally, software patents pose a constant threat to the existence of
62  any free program.  We wish to make sure that a company cannot
63  effectively restrict the users of a free program by obtaining a
64  restrictive license from a patent holder.  Therefore, we insist that
65  any patent license obtained for a version of the library must be
66  consistent with the full freedom of use specified in this license.
67  
68    Most GNU software, including some libraries, is covered by the
69  ordinary GNU General Public License.  This license, the GNU Lesser
70  General Public License, applies to certain designated libraries, and
71  is quite different from the ordinary General Public License.  We use
72  this license for certain libraries in order to permit linking those
73  libraries into non-free programs.
74  
75    When a program is linked with a library, whether statically or using
76  a shared library, the combination of the two is legally speaking a
77  combined work, a derivative of the original library.  The ordinary
78  General Public License therefore permits such linking only if the
79  entire combination fits its criteria of freedom.  The Lesser General
80  Public License permits more lax criteria for linking other code with
81  the library.
82  
83    We call this license the "Lesser" General Public License because it
84  does Less to protect the user's freedom than the ordinary General
85  Public License.  It also provides other free software developers Less
86  of an advantage over competing non-free programs.  These disadvantages
87  are the reason we use the ordinary General Public License for many
88  libraries.  However, the Lesser license provides advantages in certain
89  special circumstances.
90  
91    For example, on rare occasions, there may be a special need to
92  encourage the widest possible use of a certain library, so that it becomes
93  a de-facto standard.  To achieve this, non-free programs must be
94  allowed to use the library.  A more frequent case is that a free
95  library does the same job as widely used non-free libraries.  In this
96  case, there is little to gain by limiting the free library to free
97  software only, so we use the Lesser General Public License.
98  
99    In other cases, permission to use a particular library in non-free
100 programs enables a greater number of people to use a large body of
101 free software.  For example, permission to use the GNU C Library in
102 non-free programs enables many more people to use the whole GNU
103 operating system, as well as its variant, the GNU/Linux operating
104 system.
105 
106   Although the Lesser General Public License is Less protective of the
107 users' freedom, it does ensure that the user of a program that is
108 linked with the Library has the freedom and the wherewithal to run
109 that program using a modified version of the Library.
110 
111   The precise terms and conditions for copying, distribution and
112 modification follow.  Pay close attention to the difference between a
113 "work based on the library" and a "work that uses the library".  The
114 former contains code derived from the library, whereas the latter must
115 be combined with the library in order to run.
116 
117 		  GNU LESSER GENERAL PUBLIC LICENSE
118    TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
119 
120   0. This License Agreement applies to any software library or other
121 program which contains a notice placed by the copyright holder or
122 other authorized party saying it may be distributed under the terms of
123 this Lesser General Public License (also called "this License").
124 Each licensee is addressed as "you".
125 
126   A "library" means a collection of software functions and/or data
127 prepared so as to be conveniently linked with application programs
128 (which use some of those functions and data) to form executables.
129 
130   The "Library", below, refers to any such software library or work
131 which has been distributed under these terms.  A "work based on the
132 Library" means either the Library or any derivative work under
133 copyright law: that is to say, a work containing the Library or a
134 portion of it, either verbatim or with modifications and/or translated
135 straightforwardly into another language.  (Hereinafter, translation is
136 included without limitation in the term "modification".)
137 
138   "Source code" for a work means the preferred form of the work for
139 making modifications to it.  For a library, complete source code means
140 all the source code for all modules it contains, plus any associated
141 interface definition files, plus the scripts used to control compilation
142 and installation of the library.
143 
144   Activities other than copying, distribution and modification are not
145 covered by this License; they are outside its scope.  The act of
146 running a program using the Library is not restricted, and output from
147 such a program is covered only if its contents constitute a work based
148 on the Library (independent of the use of the Library in a tool for
149 writing it).  Whether that is true depends on what the Library does
150 and what the program that uses the Library does.
151   
152   1. You may copy and distribute verbatim copies of the Library's
153 complete source code as you receive it, in any medium, provided that
154 you conspicuously and appropriately publish on each copy an
155 appropriate copyright notice and disclaimer of warranty; keep intact
156 all the notices that refer to this License and to the absence of any
157 warranty; and distribute a copy of this License along with the
158 Library.
159 
160   You may charge a fee for the physical act of transferring a copy,
161 and you may at your option offer warranty protection in exchange for a
162 fee.
163 
164   2. You may modify your copy or copies of the Library or any portion
165 of it, thus forming a work based on the Library, and copy and
166 distribute such modifications or work under the terms of Section 1
167 above, provided that you also meet all of these conditions:
168 
169     a) The modified work must itself be a software library.
170 
171     b) You must cause the files modified to carry prominent notices
172     stating that you changed the files and the date of any change.
173 
174     c) You must cause the whole of the work to be licensed at no
175     charge to all third parties under the terms of this License.
176 
177     d) If a facility in the modified Library refers to a function or a
178     table of data to be supplied by an application program that uses
179     the facility, other than as an argument passed when the facility
180     is invoked, then you must make a good faith effort to ensure that,
181     in the event an application does not supply such function or
182     table, the facility still operates, and performs whatever part of
183     its purpose remains meaningful.
184 
185     (For example, a function in a library to compute square roots has
186     a purpose that is entirely well-defined independent of the
187     application.  Therefore, Subsection 2d requires that any
188     application-supplied function or table used by this function must
189     be optional: if the application does not supply it, the square
190     root function must still compute square roots.)
191 
192 These requirements apply to the modified work as a whole.  If
193 identifiable sections of that work are not derived from the Library,
194 and can be reasonably considered independent and separate works in
195 themselves, then this License, and its terms, do not apply to those
196 sections when you distribute them as separate works.  But when you
197 distribute the same sections as part of a whole which is a work based
198 on the Library, the distribution of the whole must be on the terms of
199 this License, whose permissions for other licensees extend to the
200 entire whole, and thus to each and every part regardless of who wrote
201 it.
202 
203 Thus, it is not the intent of this section to claim rights or contest
204 your rights to work written entirely by you; rather, the intent is to
205 exercise the right to control the distribution of derivative or
206 collective works based on the Library.
207 
208 In addition, mere aggregation of another work not based on the Library
209 with the Library (or with a work based on the Library) on a volume of
210 a storage or distribution medium does not bring the other work under
211 the scope of this License.
212 
213   3. You may opt to apply the terms of the ordinary GNU General Public
214 License instead of this License to a given copy of the Library.  To do
215 this, you must alter all the notices that refer to this License, so
216 that they refer to the ordinary GNU General Public License, version 2,
217 instead of to this License.  (If a newer version than version 2 of the
218 ordinary GNU General Public License has appeared, then you can specify
219 that version instead if you wish.)  Do not make any other change in
220 these notices.
221 
222   Once this change is made in a given copy, it is irreversible for
223 that copy, so the ordinary GNU General Public License applies to all
224 subsequent copies and derivative works made from that copy.
225 
226   This option is useful when you wish to copy part of the code of
227 the Library into a program that is not a library.
228 
229   4. You may copy and distribute the Library (or a portion or
230 derivative of it, under Section 2) in object code or executable form
231 under the terms of Sections 1 and 2 above provided that you accompany
232 it with the complete corresponding machine-readable source code, which
233 must be distributed under the terms of Sections 1 and 2 above on a
234 medium customarily used for software interchange.
235 
236   If distribution of object code is made by offering access to copy
237 from a designated place, then offering equivalent access to copy the
238 source code from the same place satisfies the requirement to
239 distribute the source code, even though third parties are not
240 compelled to copy the source along with the object code.
241 
242   5. A program that contains no derivative of any portion of the
243 Library, but is designed to work with the Library by being compiled or
244 linked with it, is called a "work that uses the Library".  Such a
245 work, in isolation, is not a derivative work of the Library, and
246 therefore falls outside the scope of this License.
247 
248   However, linking a "work that uses the Library" with the Library
249 creates an executable that is a derivative of the Library (because it
250 contains portions of the Library), rather than a "work that uses the
251 library".  The executable is therefore covered by this License.
252 Section 6 states terms for distribution of such executables.
253 
254   When a "work that uses the Library" uses material from a header file
255 that is part of the Library, the object code for the work may be a
256 derivative work of the Library even though the source code is not.
257 Whether this is true is especially significant if the work can be
258 linked without the Library, or if the work is itself a library.  The
259 threshold for this to be true is not precisely defined by law.
260 
261   If such an object file uses only numerical parameters, data
262 structure layouts and accessors, and small macros and small inline
263 functions (ten lines or less in length), then the use of the object
264 file is unrestricted, regardless of whether it is legally a derivative
265 work.  (Executables containing this object code plus portions of the
266 Library will still fall under Section 6.)
267 
268   Otherwise, if the work is a derivative of the Library, you may
269 distribute the object code for the work under the terms of Section 6.
270 Any executables containing that work also fall under Section 6,
271 whether or not they are linked directly with the Library itself.
272 
273   6. As an exception to the Sections above, you may also combine or
274 link a "work that uses the Library" with the Library to produce a
275 work containing portions of the Library, and distribute that work
276 under terms of your choice, provided that the terms permit
277 modification of the work for the customer's own use and reverse
278 engineering for debugging such modifications.
279 
280   You must give prominent notice with each copy of the work that the
281 Library is used in it and that the Library and its use are covered by
282 this License.  You must supply a copy of this License.  If the work
283 during execution displays copyright notices, you must include the
284 copyright notice for the Library among them, as well as a reference
285 directing the user to the copy of this License.  Also, you must do one
286 of these things:
287 
288     a) Accompany the work with the complete corresponding
289     machine-readable source code for the Library including whatever
290     changes were used in the work (which must be distributed under
291     Sections 1 and 2 above); and, if the work is an executable linked
292     with the Library, with the complete machine-readable "work that
293     uses the Library", as object code and/or source code, so that the
294     user can modify the Library and then relink to produce a modified
295     executable containing the modified Library.  (It is understood
296     that the user who changes the contents of definitions files in the
297     Library will not necessarily be able to recompile the application
298     to use the modified definitions.)
299 
300     b) Use a suitable shared library mechanism for linking with the
301     Library.  A suitable mechanism is one that (1) uses at run time a
302     copy of the library already present on the user's computer system,
303     rather than copying library functions into the executable, and (2)
304     will operate properly with a modified version of the library, if
305     the user installs one, as long as the modified version is
306     interface-compatible with the version that the work was made with.
307 
308     c) Accompany the work with a written offer, valid for at
309     least three years, to give the same user the materials
310     specified in Subsection 6a, above, for a charge no more
311     than the cost of performing this distribution.
312 
313     d) If distribution of the work is made by offering access to copy
314     from a designated place, offer equivalent access to copy the above
315     specified materials from the same place.
316 
317     e) Verify that the user has already received a copy of these
318     materials or that you have already sent this user a copy.
319 
320   For an executable, the required form of the "work that uses the
321 Library" must include any data and utility programs needed for
322 reproducing the executable from it.  However, as a special exception,
323 the materials to be distributed need not include anything that is
324 normally distributed (in either source or binary form) with the major
325 components (compiler, kernel, and so on) of the operating system on
326 which the executable runs, unless that component itself accompanies
327 the executable.
328 
329   It may happen that this requirement contradicts the license
330 restrictions of other proprietary libraries that do not normally
331 accompany the operating system.  Such a contradiction means you cannot
332 use both them and the Library together in an executable that you
333 distribute.
334 
335   7. You may place library facilities that are a work based on the
336 Library side-by-side in a single library together with other library
337 facilities not covered by this License, and distribute such a combined
338 library, provided that the separate distribution of the work based on
339 the Library and of the other library facilities is otherwise
340 permitted, and provided that you do these two things:
341 
342     a) Accompany the combined library with a copy of the same work
343     based on the Library, uncombined with any other library
344     facilities.  This must be distributed under the terms of the
345     Sections above.
346 
347     b) Give prominent notice with the combined library of the fact
348     that part of it is a work based on the Library, and explaining
349     where to find the accompanying uncombined form of the same work.
350 
351   8. You may not copy, modify, sublicense, link with, or distribute
352 the Library except as expressly provided under this License.  Any
353 attempt otherwise to copy, modify, sublicense, link with, or
354 distribute the Library is void, and will automatically terminate your
355 rights under this License.  However, parties who have received copies,
356 or rights, from you under this License will not have their licenses
357 terminated so long as such parties remain in full compliance.
358 
359   9. You are not required to accept this License, since you have not
360 signed it.  However, nothing else grants you permission to modify or
361 distribute the Library or its derivative works.  These actions are
362 prohibited by law if you do not accept this License.  Therefore, by
363 modifying or distributing the Library (or any work based on the
364 Library), you indicate your acceptance of this License to do so, and
365 all its terms and conditions for copying, distributing or modifying
366 the Library or works based on it.
367 
368   10. Each time you redistribute the Library (or any work based on the
369 Library), the recipient automatically receives a license from the
370 original licensor to copy, distribute, link with or modify the Library
371 subject to these terms and conditions.  You may not impose any further
372 restrictions on the recipients' exercise of the rights granted herein.
373 You are not responsible for enforcing compliance by third parties with
374 this License.
375 
376   11. If, as a consequence of a court judgment or allegation of patent
377 infringement or for any other reason (not limited to patent issues),
378 conditions are imposed on you (whether by court order, agreement or
379 otherwise) that contradict the conditions of this License, they do not
380 excuse you from the conditions of this License.  If you cannot
381 distribute so as to satisfy simultaneously your obligations under this
382 License and any other pertinent obligations, then as a consequence you
383 may not distribute the Library at all.  For example, if a patent
384 license would not permit royalty-free redistribution of the Library by
385 all those who receive copies directly or indirectly through you, then
386 the only way you could satisfy both it and this License would be to
387 refrain entirely from distribution of the Library.
388 
389 If any portion of this section is held invalid or unenforceable under any
390 particular circumstance, the balance of the section is intended to apply,
391 and the section as a whole is intended to apply in other circumstances.
392 
393 It is not the purpose of this section to induce you to infringe any
394 patents or other property right claims or to contest validity of any
395 such claims; this section has the sole purpose of protecting the
396 integrity of the free software distribution system which is
397 implemented by public license practices.  Many people have made
398 generous contributions to the wide range of software distributed
399 through that system in reliance on consistent application of that
400 system; it is up to the author/donor to decide if he or she is willing
401 to distribute software through any other system and a licensee cannot
402 impose that choice.
403 
404 This section is intended to make thoroughly clear what is believed to
405 be a consequence of the rest of this License.
406 
407   12. If the distribution and/or use of the Library is restricted in
408 certain countries either by patents or by copyrighted interfaces, the
409 original copyright holder who places the Library under this License may add
410 an explicit geographical distribution limitation excluding those countries,
411 so that distribution is permitted only in or among countries not thus
412 excluded.  In such case, this License incorporates the limitation as if
413 written in the body of this License.
414 
415   13. The Free Software Foundation may publish revised and/or new
416 versions of the Lesser General Public License from time to time.
417 Such new versions will be similar in spirit to the present version,
418 but may differ in detail to address new problems or concerns.
419 
420 Each version is given a distinguishing version number.  If the Library
421 specifies a version number of this License which applies to it and
422 "any later version", you have the option of following the terms and
423 conditions either of that version or of any later version published by
424 the Free Software Foundation.  If the Library does not specify a
425 license version number, you may choose any version ever published by
426 the Free Software Foundation.
427 
428   14. If you wish to incorporate parts of the Library into other free
429 programs whose distribution conditions are incompatible with these,
430 write to the author to ask for permission.  For software which is
431 copyrighted by the Free Software Foundation, write to the Free
432 Software Foundation; we sometimes make exceptions for this.  Our
433 decision will be guided by the two goals of preserving the free status
434 of all derivatives of our free software and of promoting the sharing
435 and reuse of software generally.
436 
437 			    NO WARRANTY
438 
439   15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
440 WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
441 EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
442 OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
443 KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
444 IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
445 PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
446 LIBRARY IS WITH YOU.  SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
447 THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
448 
449   16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
450 WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
451 AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
452 FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
453 CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
454 LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
455 RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
456 FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
457 SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
458 DAMAGES.
459 
460 		     END OF TERMS AND CONDITIONS
461 
462            How to Apply These Terms to Your New Libraries
463 
464   If you develop a new library, and you want it to be of the greatest
465 possible use to the public, we recommend making it free software that
466 everyone can redistribute and change.  You can do so by permitting
467 redistribution under these terms (or, alternatively, under the terms of the
468 ordinary General Public License).
469 
470   To apply these terms, attach the following notices to the library.  It is
471 safest to attach them to the start of each source file to most effectively
472 convey the exclusion of warranty; and each file should have at least the
473 "copyright" line and a pointer to where the full notice is found.
474 
475     <one line to give the library's name and a brief idea of what it does.>
476     Copyright (C) <year>  <name of author>
477 
478     This library is free software; you can redistribute it and/or
479     modify it under the terms of the GNU Lesser General Public
480     License as published by the Free Software Foundation; either
481     version 2.1 of the License, or (at your option) any later version.
482 
483     This library is distributed in the hope that it will be useful,
484     but WITHOUT ANY WARRANTY; without even the implied warranty of
485     MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
486     Lesser General Public License for more details.
487 
488     You should have received a copy of the GNU Lesser General Public
489     License along with this library; if not, write to the Free Software
490     Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
491 
492 Also add information on how to contact you by electronic and paper mail.
493 
494 You should also get your employer (if you work as a programmer) or your
495 school, if any, to sign a "copyright disclaimer" for the library, if
496 necessary.  Here is a sample; alter the names:
497 
498   Yoyodyne, Inc., hereby disclaims all copyright interest in the
499   library `Frob' (a library for tweaking knobs) written by James Random Hacker.
500 
501   <signature of Ty Coon>, 1 April 1990
502   Ty Coon, President of Vice
503 
504 That's all there is to it!
505  */
506 
507 package org.archive.modules.writer;
508 
509 import java.io.IOException;
510 import java.io.InputStream;
511 import java.util.List;
512 import java.util.concurrent.atomic.AtomicInteger;
513 
514 import org.apache.hadoop.hbase.client.Get;
515 import org.apache.hadoop.hbase.client.HTable;
516 import org.apache.hadoop.hbase.util.Bytes;
517 import org.apache.hadoop.hbase.util.Keying;
518 import org.apache.log4j.Logger;
519 import org.archive.io.ReplayInputStream;
520 import org.archive.io.hbase.HBaseParameters;
521 import org.archive.io.hbase.HBaseWriter;
522 import org.archive.io.hbase.HBaseWriterPool;
523 import org.archive.io.warc.WARCWriterPoolSettings;
524 import org.archive.modules.CrawlURI;
525 import org.archive.modules.ProcessResult;
526 import org.archive.spring.ConfigPath;
527 import org.archive.uid.RecordIDGenerator;
528 import org.archive.util.ArchiveUtils;
529 
530 // TODO: Auto-generated Javadoc
531 /**
532  * A <a href="http://crawler.archive.org">Heritrix 3</a> processor that writes
533  * to <a href="http://hbase.org">Hadoop HBase</a>.
534  * 
535  * The following example shows how to configure the crawl job configuration.
536  * 
537  * <pre>
538  * {@code
539  * <bean id="hbaseParameterSettings" class="org.archive.io.hbase.HBaseParameters">
540  * 	<!-- These settings are required -->
541  * 	<property name="zkQuorum" value="localhost" />
542  * 	<property name="hbaseTableName" value="crawl" />
543  * 
544  * 	<!-- This should reflect your installation, but 2181 is the default -->
545  * 	<property name="zkPort" value="2181" />
546  * 
547  * 	<!-- All other settings are optional -->
548  * 	<property name="onlyProcessNewRecords" value="false" />
549  * 	<property name="onlyWriteNewRecords" value="false" />
550  * 	<property name="contentColumnFamily" value="newcontent" />
551  *  <property name="defaultMaxFileSizeInBytes" value="26214400" />
552  *  <!-- 25 *  1024 * 1024 = 26214400 bytes -->
553  * 	<!-- Overwrite more options here -->
554  * </bean>
555  * 
556  * <bean id="hbaseWriterProcessor" class="org.archive.modules.writer.HBaseWriterProcessor">
557  * 	<property name="hbaseParameters">
558  * 		 <ref bean="hbaseParameterSettings"/> 
559  * 	</property>
560  * </bean>
561  * 
562  * <bean id="dispositionProcessors" class="org.archive.modules.DispositionChain">
563  * 	<property name="processors">
564  * 		 <list>
565  * 			<ref bean="hbaseWriterProcessor"/>
566  * 			<!-- other references -->
567  * 		</list>
568  * 	 </property>
569  * </bean>
570  * }
571  * </pre>
572  * 
573  * @see org.archive.io.hbase.HBaseParameters
574  *      {@link org.archive.io.hbase.HBaseParameters} for defining
575  *      hbaseParameters
576  * 
577  */
578 public class HBaseWriterProcessor extends WriterPoolProcessor implements WARCWriterPoolSettings {
579 
580 	/** The Constant log. */
581 	private static final Logger log = Logger.getLogger(HBaseWriterProcessor.class);
582 
583 	/** The Constant serialVersionUID. */
584 	private static final long serialVersionUID = 7019522841438703184L;
585 
586 	/** The hbase parameters. @see org.archive.io.hbase.HBaseParameters */
587 	HBaseParameters hbaseParameters = null;
588 
589 	/**
590 	 * Gets the hbase parameters.
591 	 *
592 	 * @return the hbase parameters
593 	 */
594 	public synchronized HBaseParameters getHbaseParameters() {
595 		if (hbaseParameters == null)
596 			this.hbaseParameters = new HBaseParameters();
597 
598 		return hbaseParameters;
599 	}
600 
601 	/**
602 	 * Sets the hbase parameters.
603 	 *
604 	 * @param options the new hbase parameters
605 	 */
606 	public void setHbaseParameters(HBaseParameters options) {
607 		this.hbaseParameters = options;
608 	}
609 
610 	/**
611 	 * Gets the default max file size.
612 	 *
613 	 * @return the default max file size
614 	 */
615 	@Override
616 	long getDefaultMaxFileSize() {
617 		if (hbaseParameters != null) {
618 			return (hbaseParameters.getDefaultMaxFileSizeInBytes());			
619 		} 
620 		return HBaseParameters.DEFAULT_MAX_FILE_SIZE_IN_BYTES;
621 	}
622 
623 	/* (non-Javadoc)
624 	 * @see org.archive.modules.writer.WriterPoolProcessor#setupPool(java.util.concurrent.atomic.AtomicInteger)
625 	 */
626 	@Override
627 	protected void setupPool(AtomicInteger serial) {
628 		// allow the Heritrix WriterPoolProcessor framework to create new HBaseWriterPools as needed.
629 		setPool(new HBaseWriterPool(serial, this, getPoolMaxActive(), getMaxWaitForIdleMs(), hbaseParameters));
630 	}
631 
632 	/* (non-Javadoc)
633 	 * @see org.archive.modules.writer.WriterPoolProcessor#innerProcessResult(org.archive.modules.CrawlURI)
634 	 */
635 	@Override
636 	protected ProcessResult innerProcessResult(CrawlURI uri) {
637 		CrawlURI curi = uri;
638 		long recordLength = getRecordedSize(curi);
639 		ReplayInputStream replayInputStream = null;
640 		try {
641 			if (shouldWrite(curi)) {
642 				replayInputStream = curi.getRecorder().getRecordedInput().getReplayInputStream();
643 				return write(curi, recordLength, replayInputStream);
644 			}
645 			log.info("Does not write " + curi.toString());
646 		} catch (IOException e) {
647 			curi.getNonFatalFailures().add(e);
648 			log.error("Failed write of Record: " + curi.toString(), e);
649 		} finally {
650 			ArchiveUtils.closeQuietly(replayInputStream);
651 		}
652 		return ProcessResult.PROCEED;
653 	}
654 
655 	/*
656 	 * (non-Javadoc)
657 	 * 
658 	 * @see
659 	 * org.archive.modules.Processor#shouldProcess(org.archive.modules.ProcessorURI
660 	 * )
661 	 */
662 	@Override
663 	protected boolean shouldProcess(CrawlURI curi) {
664 		// The super method is still checked, but only continue with 
665 		// process checking if it returns true.  This way the super class
666 		// overrides our checking.
667 		if (!super.shouldProcess(curi)) {
668 			return false;
669 		}
670 
671 		// If onlyProcessNewRecords is enabled and the given rowkey has cell
672 		// data,then don't process the record.
673 		if (hbaseParameters.isOnlyProcessNewRecords()) {
674 			try {
675 				return isRecordNew(curi);
676 			} catch (IOException e) {
677 				log.error("Failed write of Record: " + curi.toString(), e);
678 			}
679 		}
680 
681 		// If we make it here, then we passed all our checks and we can assume
682 		// we should process the record.
683 		return true;
684 	}
685 
686 	/**
687 	 * Whether the given CrawlURI should be written to archive files. Annotates
688 	 * CrawlURI with a reason for any negative answer.
689 	 * 
690 	 * @param curi
691 	 *            CrawlURI
692 	 * 
693 	 * @return true if URI should be written; false otherwise
694 	 */
695 	@Override
696 	protected boolean shouldWrite(CrawlURI curi) {
697 		// The old method is still checked, but only continue with the next
698 		// checks if it returns true.
699 		if (!super.shouldWrite(curi)) {
700 			return false;
701 		}
702 
703 		// If the content exceeds the maxContentSize, then dont write.
704 		if (curi.getContentSize() > getMaxFileSizeBytes()) {
705 			// content size is too large
706 			curi.getAnnotations().add(ANNOTATION_UNWRITTEN + ":size");
707 			log.warn("Content size for " + curi.getUURI() + " is too large (" + curi.getContentSize() + ") - maximum content size is: " + getMaxFileSizeBytes());
708 			return false;
709 		}
710 
711 		// If onlyWriteNewRecords is enabled and the given rowkey has cell data,
712 		// don't write the record.
713 		if (hbaseParameters.isOnlyWriteNewRecords()) {
714 			try {
715 				return isRecordNew(curi);
716 			} catch (IOException e) {
717 				log.error("Failed to write a new record for rowKey: " + curi.toString() + " using pool: " + getPool().toString(), e);
718 			}
719 		}
720 		// all tests pass, return true to write the content locally.
721 		return true;
722 	}
723 
724 	/**
725 	 * Determine if the given uri exists as a rowkey in the configured hbase
726 	 * table.
727 	 *
728 	 * @param curi the curi
729 	 * @return true, if checks if is record new
730 	 * @throws IOException Signals that an I/O exception has occurred.
731 	 */
732 	private boolean isRecordNew(CrawlURI curi) throws IOException {
733 		// get the writer from the pool
734 		HBaseWriter hbaseWriter = (HBaseWriter) getPool().borrowFile();
735 		// get the client from the writer
736 		HTable hbaseTable = hbaseWriter.getClient();
737 		// Here we can generate the rowkey for this uri ...
738 		String url = curi.toString();
739 		String row = Keying.createKey(url);
740 		// Default is true since we check for conditions to determine if the row key already exists.
741 		boolean isNew = true;
742 		try {
743 			// and look it up to see if it already exists...
744 			Get rowToGet = new Get(Bytes.toBytes(row));
745 			if (hbaseTable.get(rowToGet) != null && !hbaseTable.get(rowToGet).isEmpty()) {
746 				// if it exists, then its not new
747 				if (log.isDebugEnabled()) {
748 					log.debug("Not A NEW Record - Url: " + url + " has the existing rowkey: " + row + " and has cell data.");
749 				}
750 				isNew = false;
751 			}
752 		} catch (IOException e) {
753 			log.error("Failed to determine if record: " + row + " is a new record due to IOExecption.  Deciding the record is already existing for now.", e);
754 			isNew = false;
755 		} finally {
756 			// always return the client back to the pool no matter what
757 			try {
758 				getPool().returnFile(hbaseWriter);
759 			} catch (IOException e) {
760 				// and it its not, log as an error
761 				log.error("Failed to add back writer to the pool after checking if a rowkey is new or existing , writerPoolMember: " + hbaseWriter, e);
762 				isNew = false;
763 			}
764 		}
765 		// if we are here then the row key must be new  (it does not exist), 
766 		// so its a new record, isNew should still be set to "true" at this point.
767 		if (log.isDebugEnabled()) {
768 			log.debug("Found A NEW Record - Url: " + url + " has no existing rowkey: " + row );
769 		}
770 		return isNew;
771 	}
772 
773 	/**
774 	 * Write to HBase.
775 	 * 
776 	 * @param curi
777 	 *            the curi
778 	 * @param recordLength
779 	 *            the record length
780 	 * @param in
781 	 *            the in
782 	 * 
783 	 * @return the process result
784 	 * 
785 	 * @throws IOException
786 	 *             Signals that an I/O exception has occurred.
787 	 */
788 	protected ProcessResult write(final CrawlURI curi, long recordLength, InputStream in) throws IOException {
789 		// grab the writer from the pool
790 		HBaseWriter hbaseWriter = (HBaseWriter) getPool().borrowFile();
791 		// get the member position for logging Total Bytes Written
792 		long writerPoolMemberPosition = hbaseWriter.getPosition();
793 		try {
794 			// write the crawled data to hbase
795 			hbaseWriter.write(curi, getHostAddress(curi), curi.getRecorder().getRecordedOutput(), curi.getRecorder().getRecordedInput());
796 		} finally {
797 			// log total bytes written
798 			setTotalBytesWritten(getTotalBytesWritten() + (hbaseWriter.getPosition() - writerPoolMemberPosition));
799 			// return the hbaseWriter client back to the pool.
800 			getPool().returnFile(hbaseWriter);
801 		}
802 		// to alert heritrix what action to take next in the crawl
803 		return checkBytesWritten();
804 	}
805 
806 	/**
807 	 * Gets the default store paths.
808 	 *
809 	 * @return the default store paths
810 	 */
811 	@Override
812 	List<ConfigPath> getDefaultStorePaths() {
813 		return null;
814 	}
815 
816 	/* (non-Javadoc)
817 	 * @see org.archive.modules.writer.WriterPoolProcessor#getMetadata()
818 	 */
819 	@Override
820 	public List<String> getMetadata() {
821 		return null;
822 	}
823 
824 	/* (non-Javadoc)
825 	 * @see org.archive.io.warc.WARCWriterPoolSettings#getRecordIDGenerator()
826 	 */
827 	@Override
828 	public RecordIDGenerator getRecordIDGenerator() {
829 		return null;
830 	}
831 
832 }