Introducing Lexoid - An Efficient Document to Markdown Converter
Lexoid is an open-source document parsing library designed to transform unstructured documents efficiently into structured Markdown format.
Key Features of Lexoid:
Dual Parsing Mode: Lexoid offers LLM-based parsing, utilizing advanced language models from providers like OpenAI and Google for complex documents, and static parsing using traditional tools like PDFPlumber for structured files.
Support for Open-weighted/Open-source Models via Hugging Face, Together AI, OpenRouter, Fireworks
Automatic Parsing Selection (AUTO Mode): Lexoid's AUTO mode intelligently selects the optimal parsing strategy for each document segment, balancing accuracy and processing cost. For instance, AUTO mode achieved similar accuracy in benchmark tests to full LLM parsing while reducing costs by up to 50%.
Markdown Conversion: Converts parsed content into Markdown, a format easily understood by large language models, facilitating integration for downstream tasks.
Hyperlink Preservation: Maintains hyperlinks during parsing, ensuring references and external links within documents are not lost.
Recursive URL Parsing: It supports recursive parsing of websites, allowing content extraction from multiple levels of linked pages, with controllable depth settings.
Multi-Format Support: Lexoid can handle various file formats, including PDFs, Word documents, PowerPoint presentations, Excel spreadsheets, HTML pages, and CSV files.
Parallel Processing: Designed for efficiency, Lexoid supports parallel processing, enabling simultaneous parsing of multiple documents or pages.
Permissible License: Released under the Apache 2.0 license, Lexoid is free to use and modify, encouraging community contributions and integration into diverse workflows.
The library is in its early days but continuously improving to offer a versatile and efficient solution for converting various document types into structured Markdown and balancing advanced parsing capabilities with cost-effective processing.
Early benchmarks:
Want to dive deeper?
Read the full article: Introducing Lexoid – An Efficient Document-to-Markdown Converter
Ready to explore the code?
Check out Lexoid on GitHub: github.com/oidlabs-com/Lexoid
Try Lexoid in action:
• Run it on Hugging Face Spaces – No setup needed, see how it works instantly
• Open in Google Colab – Test it interactively with your own documents