PHP Classes
elePHPant
Icontem

Extract PDF to text and XML: I need to parse a PDF file and convert whole text into XML

Recommend this page to a friend!
  All requests RSS feed  >  Extract PDF to text and XML  >  Request new recommendation  >  A request is featured when there is no good recommended package on the site when it is posted. Featured requests  >  No recommendations No recommendations  

Extract PDF to text and XML

A request is featured when there is no good recommended package on the site when it is posted. Edit

by Anand Lagad - 2 years ago (2015-06-14)

I need to parse a PDF file and convert whole text into XML

This request is clear and relevant.
This request is not clear or is not relevant.

+1

I need PHP code to parse any PDF file and convert it into the XML format.

I think we can not examine the HTML tags in PDF, so I think that first of all we should parse whole PDF ,then convert it into the xml.

What I want is, if the PDF document contains table, I want table fields as XML tag and table data as a values.

  • 1 Clarification request
  • 1. by Manuel Lemos - 2 years ago (2015-06-15) Reply

    I do not think that right now there is a class here that can convert an arbitrary PDF document to XML, HTML or any format that preserves the document structure.

    There are classes for converting PDF to images of the pages, but I am not sure if that would address your needs.

    There are solutions that require using external Web services or external programs like xpdf or Ghostscript. If that would do for you, maybe somebody can submit a class that wraps around those Web services or programs.

    Ask clarification

    1 Recommendation

    PHP DOC DOCX PDF to Text Converter: Convert DOCX, DOC, PDF to plain text

    This recommendation solves the problem.
    This recommendation does not solve the problem.

    0

    by Dave Smith Reputation 6255 - 2 years ago (2015-06-14) Comment

    The innovation nomination description indicates that this class will extract document elements in addition to text, which is what you will need to extract tables.

    • 8 Comments
    • 1. by Manuel Lemos - 2 years ago (2015-06-15) Reply

      I think the original poster wants a solution that preserves the original document structure. So, just extracting text may not be enough for him.

    • 2. by adam berger - 2 years ago (2015-06-15) Reply

      An interesting project would be happy to'll try the same class to convert pdf to xml I am waiting for results :)

    • 3. by adam berger - 2 years ago (2015-06-15) in reply to comment 2 by adam berger Reply

      I suggest you first perform a conversion to html in the cache and then to xml. This can be done on the fly with cache

    • 4. by Manuel Lemos - 2 years ago (2015-06-15) in reply to comment 3 by adam berger Reply

      Well, XHTML is still HTML and XML.

    • 5. by Dave Smith - 2 years ago (2015-06-15) in reply to comment 1 by Manuel Lemos Reply

      If the comments for the innovation nomination of this class is correct, or I am not misreading it, the class should be able to get the document elements, not just text. That is the basis of my recommendation.

    • 6. by Manuel Lemos - 2 years ago (2015-06-16) in reply to comment 5 by Dave Smith Reply

      What the nomination comments say is that extracting document elements is not a trivial task. That class just extracts text using a simple approach.

    • 7. by Dave Smith - 2 years ago (2015-06-16) in reply to comment 6 by Manuel Lemos Reply

      Okay, looks like I was confused. Better to have tried and failed than to not have tried at all :)

      Looks like adam berger will attempt the non trivial task.

    • 8. by Manuel Lemos - 2 years ago (2015-06-16) in reply to comment 7 by Dave Smith Reply

      That is OK, maybe my wording was not ideal either.


    Recommend package
    : 
    :