Search inside Lucene in Action

Query parsed to: +index +pdf

1 - 19 of 19 results (Page 1 of 1)

2.1.1 : Conversion to text

starts on page 29 under section 2.1 (Understanding the indexing process) in chapter 2 (Indexing)

...'t always that simple. Suppose you need to index a set of manuals in PDF format. To prepare these manuals for indexing, you must first find a way to extract the textual information from the PDF ... To index data with Lucene, you must first convert it to a stream of plain-text tokens, the format that Lucene can digest. In chapter 1, we limited our examples to indexing and searching .txt files ... and Reader values. No methods would accept a PDF Java type, even if such a type existed. You face...

7.3.2 : Built-in Lucene support

starts on page 239 under section 7.3 (Indexing a PDF document) in chapter 7 (Parsing common document formats)

.../otis/PDFs /tmp/pdfindex Indexing PDF document: /home/otis/PDFs/Concurrency-j-jtp07233.pdf Indexing PDF document: /home/otis/PDFs/CoreJSTLAppendixA.pdf Indexing PDF document: /home/otis/PDFs/CoreJSTLChapter2.pdf Indexing PDF document: /home/otis/PDFs/CoreJSTLChapter5.pdf Indexing PDF document: /home/otis/PDFs/Google-Arch.pdf Indexing PDF document: /home/otis/PDFs/JavaCookbook-Chapter22-RMI.pdf Indexing PDF document: /home/otis/PDFs/JavaSockets.pdf Indexing PDF document: /home/otis/PDFs...

7.3 : Indexing a PDF document

starts on page 235 in chapter 7 (Parsing common document formats)

...Portable Document Format (PDF) is a document format invented by Adobe Sys- tems over a decade ago ... , hyperlinks, colors, and more. Today, PDF is widespread, and in some domains it's the dominant format ... declaration forms, product manuals, and so on most often come as PDF docu- ments. Even this book is available as PDF; Manning Publications sells chapters of most of its books electronically, allowing customers to buy individual chapters and immediately download them. If you've ever opened PDF documents...

1.5.4 : Document

starts on page 20 under section 1.5 (Understanding the core indexing classes) in chapter 1 (Meet Lucene)

... modified, and so on, are indexed and stored separately as fields of a document. NOTE When we refer to a document in this book, we mean a Microsoft Word, RTF, PDF, or other type of a document; we aren ... . Although various types of documents can be indexed and made searchable, processing them ... Java type. You'll learn more about handling nontext documents in chapter 7. In our Indexer, we're concerned with indexing text files. So, for each text file we find, we create a new instance... [Full sample chapter]

1.2.2 : What Lucene can do for you

starts on page 7 under section 1.2 (Understanding Lucene) in chapter 1 (Meet Lucene)

...Lucene allows you to add indexing and searching capabilities to your applications (these functions are described in section 1.3). Lucene can index and make search- able any data that can be converted ... as you can convert it to text. This means you can use Lucene to index and search data stored ... , Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information. Similarly, with Lucene's help you can index data stored in your databases, giv- ing your... [Full sample chapter]

1.5.5 : Field

starts on page 20 under section 1.5 (Understanding the core indexing classes) in chapter 1 (Meet Lucene)

...Each Document in an index contains one or more named fields, embodied in a class called Field. Each field corresponds to a piece of data that is either queried against or retrieved from the index ... --Isn't analyzed, but is indexed and stored in the index verbatim. This type is suitable for fields whose ... path in Indexer (listing 1.1) as a Keyword field. UnIndexed--Is neither analyzed nor indexed, but its value is stored in the index as is. This type is suitable for fields that you need to display... [Full sample chapter]

7.3.1 : Extracting text and indexing using PDFBox

starts on page 236 under section 7.3 (Indexing a PDF document) in chapter 7 (Parsing common document formats)

... textual content from a PDF document, as well as document meta-data, and create a Lucene Document suitable for indexing. Listing 7.5 DocumentHandler using the PDFBox library to extract text from PDF ... .pdfbox.org/. There are several free tools capable of extracting text from PDF files; we chose PDFBox ... DocumentHandlerException( "Cannot parse PDF document", e); } // decrypt the PDF document, if it is encrypted ... e) { closeCOSDocument(cosDoc); throw new DocumentHandlerException( "Cannot decrypt PDF document", e); } catch...

1.7.1 : IR libraries

starts on page 24 under section 1.7 (Review of alternate search products) in chapter 1 (Meet Lucene)

.... Egothor A full-text indexing and searching Java library, Egothor uses core algorithms that are very ... ready-to-use applications, such as a web crawler called Capek, a file indexer with a Swing GUI, and more. It also provides parsers for several rich-text document formats, such as PDF and Microsoft Word documents ... indexer and document parsers are similar to the small document parsing and indexing framework presented ... project is comparable to Lucene in most aspects. If you have yet to choose a full-text indexing... [Full sample chapter]

7.0 : Parsing common document formats

starts on page 223

...This chapter covers Parsing XML using the SAX 2.0 API and Jakarta Commons Digester Parsing PDF ... a document indexing framework and application 223 So far in this book, we have covered various aspects ... with PDF, Microsoft Word, or Excel documents. The World-Wide Web typically contains data in HTML ... rich-text documents like these? Yes, you can! Although Lucene doesn't include tools to automatically index ... to extract the textual data from rich media.1 Once extracted, you can index the data with Lucene...

7.9.1 : Document-management systems and services

starts on page 264 under section 7.9 (Other text-extraction tools) in chapter 7 (Parsing common document formats)

...In addition to individual libraries that you can use to implement document pars- ing and indexing ... do that--and, interestingly enough, rely on Lucene to handle document indexing: DocSearcher (http://www.brownsite.net ... and POI Apache APIs as well as the Open Source PDF Box API to provide searching capabilities for HTML, MS Word, MS Excel, RTF, PDF, Open Office (and Star Office) documents, and text documents." Docco (http://tockit.sourceforge.net/docco/index.html) is a small, personal document management system built...

10.4.1 : The system architecture

starts on page 347 under section 10.4 (Competitive intelligence with Lucene in XtraMind's XM-InformationMinderTM) in chapter 10 (Case studies)

... parts is based upon the functionalities pro- vided by Lucene, with each employing its own index ... of the information that can be found in the Lucene index for two specific reasons: Failure recovery--If the index somehow becomes corrupted (for example, through disk failure), it can easily and quickly ... have to search its whole index for the document with the identifier stored in one of the document ... . 2 The agent performs the crawling process and fetches all relevant web pages, PDF, Word, Rich Text, and other...

7.9 : Other text-extraction tools

starts on page 264 in chapter 7 (Parsing common document formats)

...In this chapter, we've presented text extraction from, and indexing of, the most common document formats. We chose tools that are the most popular among developers, tools that are still being developed (or at least maintained), and tools that are easy to use. All libraries that we've presented ... Tool Where to download PDF Xpdf http://www.foolabs.com/xpdf/ JPedal http://www.jpedal.org/ Etymon PJ http://www.etymon.com/ PDF Text Stream http://snowtide.com/home/PDFTextStream Multivalent http...

7.10 : Summary

starts on page 265 in chapter 7 (Parsing common document formats)

... type of data that can be con- verted to text can be indexed and made searchable with Lucene. If you can extract textual data from sound or graphics files, you can index those, too. As a matter of fact, section 10.6 describes one interesting approach to indexing JPEG images. We used a number ... and NekoHTML for HTML, PDFBox for PDF, and POI and TextMining.org extractors for Microsoft Word documents ... frame- work capable of recursively parsing and indexing a file system. What you've learned...

7.8.5 : FileIndexer drawbacks, and how to extend the framework

starts on page 263 under section 7.8 (Creating a document-handling framework) in chapter 7 (Parsing common document formats)

... the following types of input: XML PDF HTML Microsoft Word RTF Plain text So, what do you do if you need to index and make searchable files of a type that our framework doesn't handle? You extend...

What's up with the hyphens in some of the search results?

This is an artifact of how the book content was indexed (a text version of the PDF was processed, including the words split across lines). These split words are, however, searchable! There is a fair bit of analysis trickery going on to piece this stuff back together during indexing, but the stored content still contains the hyphens. [Permalink]

7.2.2 : Parsing and indexing using Digester

starts on page 230 under section 7.2 (Indexing XML) in chapter 7 (Parsing common document formats)

... the next popular format: PDF....

SearchBlox J2EE Search Component Version 2.1 released

From a lucene-user e-mail list announcement:

 SearchBlox is a J2EE Search Component that delivers out-of-the-box search functionality for fast and easy implementation with your websites, applications, intranets and portals. SearchBlox uses the Lucene Search API and incorporates integrated HTTP/HTTPS and File System crawlers, support for various document formats including HTML, Word, PDF, PowerPoint and Excel, support for indexing and searching content in 18 languages and fully customizable search results, all controlled from a browser-based Admin Console. Main features in this release: ============================== - Support for Disk-based search index. Administrators can now choose where the index is held during operations, In-Memory or On-Disk - Preset filter: a pre-defined search query that will be automatically added to the end-user's search query - Indexing performance and stability improvements - Bug fixes SearchBlox is available as a Web Archive (WAR) and is deployable on any Servlet 2.3/JSP 1.2 compliant server. SearchBlox is also available as SearchBlox Server. The Server is an integrated application incorporating everything you need to run SearchBlox. The Server includes SearchBlox J2EE Component, the Jetty Application Server and the Java Runtime Environment (JRE) 1.4. With the SearchBlox Server, there are no additional software requirements to deploy SearchBlox. The SearchBlox FREE Edition is available free of charge and can index up to 1000 documents. The software can be downloaded from http://www.searchblox.com

Robert Selvaraj of SearchBlox contributed a case study on SearchBlox to Lucene in Action, appearing in section 10.3. [Permalink]

index

starts on page 416

...18N. See internationalization parallelization 52-54 InformationMinder 347 index optimization 56-59 P ... ent 264 PDF 8 developers 10 See also indexing PDF documentation 388 N PDF Text Stream 264 downloading ... Lucene 391 C Almaer, Dion 371 building Sandbox 310 alternative spellings 354 indexing a fileset 284 C++ 10 analysis 103 Antiword 264 CachingWrappingFilter during indexing 105 ANTLR ... C. 26 supported platforms 314 Dutch 282 Berkeley DB, storing Unicode support 316 field types 105 index...

about this book

starts on page xxv

...'s primary competition. With- out wasting any time, we immediately build simple indexing and searching ... indexing operations. We describe the various field types and techniques for indexing numbers xxv and dates. Tuning the indexing process, optimizing an index, and how to deal with thread-safety ... human-entered query expressions. Chapter 4 delves deep into the heart of Lucene's indexing magic ... 's built-in support for query multiple indexes, even in parallel and remotely. Chapter 6 goes well...