Found 50 packages
Classes for running Apache Tika through **TikaOnDotNet**. Just use TextExtractor.Extract() and you'll be on your way.
Simple Pdf text extractor based on PDFSharp. Supports both single and two-byte fonts, ToUnicode maps, Encodings. Doesn't support precise symbol positioning on page so text order can differ from the original.
Extracts string from .NET solutions and projects for GetText Catalog template files (.pot).
GroupDocs.Parser for .NET is a useful parsing class library which allows to extract different data from documents of various formats. The data extraction API supports PDF, DOC, DOCX, PPT, PPTX, XLS, XLSX and many more formats.
Library for text extraction. Supports doc, docx, xlsx, odt, pdf, rtf, html, rar, zip,
Bytescout PDF Extractor SDK for .NET, ASP.NET, ActiveX - extract data from PDF documents
Winnovative PDF Images Extractor Library for .NET (Classic) can be used in .NET Framework, .NET Core and .NET Standard applications to extract images from PDF documents. This package is compatible with .NET Framework, .NET Core and .NET Standard 2.0 on Windows platforms. For applications that need to run on both Windows and Linux platforms, you can use the Winnovative.Pdf.Next.PdfProcessor package, which allows you to extract text and images from PDF documents, search text in PDF documents and convert PDF pages to images. The compatibility list includes the following .NET versions, platforms and application types: * .NET Framework 4.0 and above * .NET 10, 9, 8, 7, 6 * .NET Standard 2.0 * Windows platforms * Azure App Service * Azure Cloud Services and Azure Virtual Machines * Web, Console and Desktop applications Main Features: * Extract images from PDF documents * Preserve transparency information from PDF documents * Extract images in memory or to image files in a folder * Save the extracted images in various image formats * Support for password-protected PDF documents * Extract images only from a range of PDF pages * Get the number of pages in a PDF document * Get the PDF document title, keywords, author and description * Does not require Adobe Reader or other third-party tools Documentation and code samples: https://www.winnovative-software.com/winnovative-pdf-images-extractor-dotnet
Simple C# library for extracting text and metadata from .docx, .pptx, and .xlsx files
Description
A simple C# shell wrapper for the wonderful pdfplumber library in Python to extract text from .PDF files
A c# library that provides the ability to extract text from various document file formats, e.g. pdf, docx, ppt, etc...
EVO PDF Images Extractor Library for .NET (Classic) can be used in .NET Framework, .NET Core and .NET Standard applications to extract images from PDF documents. This package is compatible with .NET Framework, .NET Core and .NET Standard 2.0 on Windows platforms. For applications that need to run on both Windows and Linux platforms, you can use the EvoPdf.Next.PdfProcessor package, which allows you to extract text and images from PDF documents, search text in PDF documents and convert PDF pages to images. The compatibility list includes the following .NET versions, platforms and application types: * .NET Framework 4.0 and above * .NET 10, 9, 8, 7, 6 * .NET Standard 2.0 * Windows platforms * Azure App Service * Azure Cloud Services and Azure Virtual Machines * Web, Console and Desktop applications Main Features: * Extract images from PDF documents * Preserve transparency information from PDF documents * Extract images in memory or to image files in a folder * Save the extracted images in various image formats * Support for password-protected PDF documents * Extract images only from a range of PDF pages * Get the number of pages in a PDF document * Get the PDF document title, keywords, author and description * Does not require Adobe Reader or other third-party tools Documentation and code samples: https://www.evopdf.com/evopdf-pdf-images-extractor-dotnet
This is a renderer for Melville.PDF that extracts all of the text from a PDF page.
A c# library that provides the ability to extract text from various document file formats, e.g. pdf, docx, ppt, etc...
Bare-bones IKVM Java-to-.NET port of Apache Tika. You'll want to install TikaOnDotNet.TextExtractor.
A c# library that provides the ability to extract text from various document file formats, e.g. pdf, docx, ppt, etc...
A c# library that provides the ability to extract text from various document file formats, e.g. pdf, docx, ppt, etc...
A c# library that provides the ability to extract text from various document file formats, e.g. pdf, docx, ppt, etc...
A c# library that provides the ability to extract text from various document file formats, e.g. pdf, docx, ppt, etc...
The ExpertPdf Pdf to Text Converter can be used in any type of .NET application to extract the text from a PDF document. The integration with existing .NET applications is extremely easy and no installation is necessary in order to run the converter. The downloadable archive contains the assembly for .NET 2.0, .NET 4.0, .NET Core and a ready-to-use sample console application. The full C# and VB.NET source code for the sample application is available in the Samples folder. The sample application can be built with any version of Visual Studio. The result of conversion is a .NET String object that you can use for example in search operations or save into a file on disk. Features - .NET 2.0, .NET 4.0, .NET Core development library, C# and VB.NET samples - Extract text from PDF stream or a PDF file - Extract text preserving the original PDF layout - Extract text in PDF reading order - Specify the range of pages to be extracted - Save the extracted text in a HTML format and add description meta tags - Add the title, keywords, author from PDF description in HTML meta tags - Mark the page breaks in the extracted text with a special character - Extract text from password protected PDF documents - Get the number of pages in the PDF document - Search for text in PDF documents (return texts page numbers and position on page)