Search inside Lucene in Action

Query parsed to: +microsoft +word +framework

1 - 6 of 6 results (Page 1 of 1)

7.8.5 : FileIndexer drawbacks, and how to extend the framework

starts on page 263 under section 7.8 (Creating a document-handling framework) in chapter 7 (Parsing common document formats)

... has a .txt file extension, and no other; that the .doc extension is reserved for Microsoft Word documents; and so on. The framework that we developed in this chapter includes parsers that can handle the following types of input: XML PDF HTML Microsoft Word RTF Plain text So, what do you do if you need to index and make searchable files of a type that our framework doesn't handle? You extend ... This framework has one obvious, although minor, flaw: It assumes that the file extensions don't lie...

2.1.1 : Conversion to text

starts on page 29 under section 2.1 (Understanding the indexing process) in chapter 2 (Indexing)

... the same situation if you want to index Microsoft Word documents or any docu- ment format other than ... . The details of text extraction are in chapter 7 where we build a small but com- plete framework...

7.0 : Parsing common document formats

starts on page 223

... documents with PDFBox Parsing HTML using JTidy and NekoHTML Parsing Microsoft Word documents ... with PDF, Microsoft Word, or Excel documents. The World-Wide Web typically contains data in HTML ... such as plain text, PDF, Microsoft Word, HTML, XML, and RTF with Lucene. Each example uses a third-party ... a document indexing framework and application 223 So far in this book, we have covered various aspects ... as an abstraction to nest within a rich framework for parsing and indexing docu- ments of any type...

7.10 : Summary

starts on page 265 in chapter 7 (Parsing common document formats)

...In this code-rich chapter, you learned how to handle several common document formats, from the omnipresent but proprietary Microsoft Word format to the omnipresent and open HTML. As you can see, any ... and NekoHTML for HTML, PDFBox for PDF, and POI and TextMining.org extractors for Microsoft Word documents ... frame- work capable of recursively parsing and indexing a file system. What you've learned ... framework to index web pages, files stored on remote FTP servers, files stored on remote servers on your...

1.7.1 : IR libraries

starts on page 24 under section 1.7 (Review of alternate search products) in chapter 1 (Meet Lucene)

.... It also provides parsers for several rich-text document formats, such as PDF and Microsoft Word documents ... indexer and document parsers are similar to the small document parsing and indexing framework presented... [Full sample chapter]

index

starts on page 416

.... See Nutch merging indexes 52 session 77 Microsoft Word documents Jakarta Commons Digester HTTP ... 14 Overture 6 with synonyms 132 Microsoft Word 8 Piccolo 264 parsing 107 P Plucene 318-320 Miller, George ... 388 alternative word CzechAnalyzer 282 Dutch 354 suggestions 128 DutchAnalyzer 282 analysis 103 D API ... 7, 372 format 393 analyzer 123 Levenshtein distance framework 225-226, 254-263 information overload ... 89 Microsoft 6, 318 OpenOffice SDK 264 position increment issue 138 Microsoft Index Server 26 optimize...