Working with Table Data in Documents: Tips and Tricks for LLM

Eason
4 min readOct 28, 2023

In the world of language model solutions, RAG (Retrieval-Augmented Generation) stands out as a revolutionary approach that combines retrieval and generation techniques. RAG leverages pre-existing information to enhance the accuracy and relevance of generated responses, making it a highly effective method for natural language processing.

I have received inquiries regarding the handling of table data in document files, such as PDFs and Word documents, to make them more comprehensible to the Large Language Model (LLM). This is a common issue, as the document pre-processing treats all text in tables as unstructured text, which makes it challenging for LLM to comprehend the data’s meaning or structure. Consequently, LLM’s results may not be adequate. It is important to note that the issue does not lie with LLM or RAG but rather with the preparedness of the data. In order to yield optimal results from LLM, the data must be prepared in a way that is understandable by LLM.

Tables are often used in documents, such as research papers and product brochures, to enhance illustration. This is an example from the research paper about The Value of Open AI and Chat GPT for the Current Learning Environments and The Potential Future Uses from ResearchGate.

Let’s collaborate and find a solution. The key step is to identify all the tables within a document, for example, the table in a PDF document. There are numerous Python libraries available that can extract data from PDF files and detect tables, such as PyMuPDF and PyPDF2. For the purpose of this example, I will use PyMuPDF to identify and extract table…

--

--