Simple C# library for extracting text and metadata from .docx, .pptx, and .xlsx files
$ dotnet add package DocumentTextExtractorDocumentTextExtractor provides simple methods for extracting text and metadata from .docx, .pptx, and .xlsx files.
docx, pptx, and xlsxThis library has been tested on a limited set of documents. It is highly likely that documents exist this from which the library, in its current state, cannot extract text.
Refer to the Test project for a full example.
using DocumentTextExtractor;
void Main(string[] args)
{
using (DocxTextExtractor docx = new DocxTextExtractor("./temp/", "mydocument.docx"))
{
string docxText = docx.ExtractText();
Dictionary<string, string> docxMetadata = docx.ExtractMetadata();
}
using (PptxTextExtractor pptx = new DocxTextExtractor("./temp/", "mypresentation.pptx"))
{
string pptxText = pptx.ExtractText();
Dictionary<string, string> pptxMetadata = pptx.ExtractMetadata();
}
using (XlsxTextExtractor xlsx = new XlsxTextExtractor("./temp/", "mypresentation.pptx"))
{
string xlsxText = xlsx.ExtractText();
Dictionary<string, string> xlsxMetadata = xlsx.ExtractMetadata();
}
}
Please refer to CHANGELOG.md.