GroupDocs.Parser provides the functionality To extract data from HTML documents and other markup formats.
The following table provides the list of supported formats:
Format Description HTML Hypertext Markup Language File XHTML Extensible Hypertext Markup Language File MHTML MIME HTML File MD Markdown XML XML File More resources GitHub examples You may easily run the code above and see the feature in action in our GitHub examples:
GroupDocs.Parser for .NET examples GroupDocs....extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails...
GroupDocs.Parser provides the functionality To extract data from documents on the local disk.
The following example shows how To load the document from the local disk:
// Set the filePath String filePath = Constants.SamplePdf; // Create an instance of Parser class with the filePath try (Parser parser = new Parser(filePath)) { // Extract a text inTo the reader try (TextReader reader = parser.getText()) { // Print a text from the document // If text extraction isn't supported, a reader is null System....extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails...
This article explains that how To extract Markdown formatted text from document page in Java....extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails...
Reading MOV format-specific properties The GroupDocs.Metadata API supports extracting QuickTime aToms from a MOV video. The aTom is the basic data unit in any QuickTime file. Please find more information on QuickTime aToms in the official specification: https://developer.apple.com/library/archive/documentation/QuickTime/QTFF/QTFFPreface/qtffPreface.html
The following are the steps To extract QuickTime aToms from a MOV video.
Load a MOV video Get the root metadata package Extract the native metadata package using MovRootPackage.MovPackage Read the QuickTime aToms AdvancedUsage....edit metadata of PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, emails...
For all supported image formats the GroupDocs.Metadata API allows extracting common image properties such as width and height, MIME type, byte order, etc. Please see the code snippet below for more information on the feature.
Load an image Extract the root metadata package Use the getImageType method To obtain file format information advanced_usage.managing_metadata_for_specific_formats.image.ImageReadFileFormatProperties
try (Metadata metadata = new Metadata(Constants.InputPng)) { ImageRootPackage root = metadata.getRootPackageGeneric(); System.out.println(root.getImageType().getFileFormat()); System.out.println(root.getImageType().getByteOrder()); System.out.println(root.getImageType().getMimeType()); System.out.println(root.getImageType().getExtension()); System.out.println(root.getImageType().getWidth()); System.out.println(root.getImageType().getHeight()); } More resources GitHub examples You may easily run the code above and see the feature in action in our GitHub examples:...edit metadata of PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, emails...
Learn how To exclude system pre-installed fonts from HTML markup To reduce rendered document size when rendering documents using GroupDocs.Viewer for Java....Extension Portable Document Format PDF Microsoft Word DOC, DOCX, DOCM...TEX Microsoft PowerPoint PPT, PPTX, PPS, PPSX OpenDocument Formats...
Learn how To get basic document information including file type, page count, and file size using GroupDocs.Parser for .NET. Get document properties in C#....extract data from PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX, Emails...