This article shows how to extract formatted text represented as HTML or Markdown with GroupDocs.Parser from documents of various formats like Emails, Ebooks (EPUB, FB2, CHM), Microsoft Office formats: Word (DOC, DOCX), PowerPoint (PPT, PPTX), Excel (XLS, XLSX), LibreOffice formats and many others....Editor Product Solution GroupDocs...Family / GroupDocs.Parser for Java / Developer Guide / Basic Usage...
Learn this article and check how to convert Microsoft Word DOCX, DOC, RTF documents to other formats with GroupDocs.Conversion for Java....Editor Product Solution GroupDocs...GroupDocs.Conversion for Node.js via Java / Developer guide / Basic usage...
GroupDocs.Redaction supports both types of image documents for Optical Character Recognition (OCR):
image files, such as printed document scans (PNG, JPG, etc.) embedded images within office documents (PDF, DOCX, etc.) You have to implement IOcrConnector interface and pass the instance to RedactorSettings constructor.
For more details, see OCR Usage Basics article.
OCR usage limitations There are the following limitations of the OCR with GroupDocs.Redaction for Java v21.6:
textual replacements are not supported, so you have to use color box replacements to redact text in images....Editor Product Solution GroupDocs...Family / GroupDocs.Redaction for Java / Developer Guide / Advanced...
Note GroupDocs.Parser is a feature-reach document data parsing API. Here you may find description of the most important features. Parse Document by Template GroupDocs.Parser allows to parse documents by user-defined templates.
It is easy to crate a template with data field definitions, table definitions. Then it’s easy to use the template (just pass the Template object to parseByTemplate(Template) method) and extract data such as prices, invoices, tables from your typical documents....Editor Product Solution GroupDocs...Family / GroupDocs.Parser for Java / Getting Started / Features...
GroupDocs.Parser provides the functionality to extract data from HTML documents and other markup formats.
The following table provides the list of supported formats:
Format Description HTML Hypertext Markup Language File XHTML Extensible Hypertext Markup Language File MHTML MIME HTML File MD Markdown XML XML File More resources GitHub examples You may easily run the code above and see the feature in action in our GitHub examples:
GroupDocs.Parser for .NET examples GroupDocs....Editor Product Solution GroupDocs...Family / GroupDocs.Parser for Java / Developer Guide / Advanced...
GroupDocs.Conversion for Node.js via Java supports DOCX, DOCM, DOC, DOT, DOTM, XLS, XLSX, PDF, PPT, JPG, PNG, HTML, EML and many more...Editor Product Solution GroupDocs...GroupDocs.Conversion for Node.js via Java / Get started / Supported file...
DOC to TIFF document converter - convert DOC to TIFF online for free, no registration required. Secure and easy to use DOC to TIFF conversion!...NET & Java examples. Other Supported Conversions...Format) Convert DOC TO MD (Markdown) Convert DOC TO AZW3 (Kindle...