Document To Knowledge Graph Visualizer

Overview

This project leverages the power of AI to transform unstructured documents, specifically PDFs, into structured knowledge graphs. By converting and processing documents, the project extracts valuable data, which is then organized into RDF triples and ontologies. These are ultimately visualized in a knowledge graph, making the information more accessible and easier to query.

Key Features

  • AI-Driven Text Chunking: After converting PDFs to HTML, the content is intelligently split into manageable, logically coherent chunks. This is achieved using Langchain's RecursiveCharacterTextSplitter for initial splitting and OpenAI's API to further refine the chunks based on context, ensuring that each chunk maintains topic continuity and logical separation.
  • AI-Powered Ontology Generation: The AI generates RDF triples and T-box ontologies from the text chunks. OpenAI's models transform raw textual data into structured ontological components, ensuring logical and contextually accurate triples and classes. The AI further helps to generalize the T-box ontology, merging multiple chunks into a cohesive structure.
  • Automated A-box Creation: Using the AI-generated T-box as a foundation, the project automatically constructs an A-box ontology, organizing individual instances and relationships according to the structured schema defined by the T-box.
  • Knowledge Graph Visualization: The resulting A-box ontology is uploaded and visualized in GraphDB, allowing users to explore and query the structured data efficiently.

Technologies Used

  • Python: For all data processing, including PDF conversion, text chunking, and ontology generation.
  • OpenAI API: For AI-driven text chunking and ontology generation, ensuring that the knowledge graph is both comprehensive and logically structured.
  • GraphDB: For visualizing and querying the resulting knowledge graph.

AI Integration

The project makes extensive use of AI, particularly for:

  • Text Chunking: AI is used to intelligently split large text blocks into coherent chunks that maintain logical flow and thematic consistency.
  • Ontology Generation: AI transforms text into RDF triples and ontologies, ensuring that the data is structured in a way that makes sense both contextually and logically.

Demo

Example Text:

Thomas Jeffrey Hanks (born July 9, 1956) is an American actor and filmmaker. Known for both his comedic and dramatic roles, he is one of the most popular and recognizable film stars worldwide, and is regarded as an American cultural icon. Hanks's films have grossed more than $4.9 billion in North America and more than $9.96 billion worldwide, making him the fourth-highest-grossing actor in North America.

Hanks made his breakthrough with leading roles in a series of comedies: Splash (1984), The Money Pit (1986), Big (1988) and A League of Their Own (1992). He won two consecutive Academy Awards for Best Actor, playing a gay lawyer suffering from AIDS in Philadelphia (1993) and the title character in Forrest Gump (1994). Hanks collaborated with Steven Spielberg on five films: Saving Private Ryan (1998), Catch Me If You Can (2002), The Terminal (2004), Bridge of Spies (2015) and The Post (2017), as well as the World War II miniseries Band of Brothers (2001), The Pacific (2010) and Masters of the Air (2024). He has also frequently collaborated with directors Ron Howard, Nora Ephron, and Robert Zemeckis.

GraphDB:

Class Hierarchy

Class Hierarchy

Class Relationships

Class Relationships

Explore the source code on GitHub:

GitHub Repository