ZPI Day

BDDM

Bibliographic Data Disambiguation Module

Members:   Yigit, Martin and
Project mentor:   Krystian Wojtkiewicz

Project Objectives

Business Goals:

  • Improve Data Quality and Integrity: The main goal is to create a more accurate and reliable bibliographic database. By correctly disambiguating authors and publications, the system will ensure that the data is trustworthy, which is the foundational asset of any academic information system.
  • Reduce Operational Costs: The project aims to significantly decrease the man-hours required for manual data curation. By automating the routine cases of data linking and flagging only the truly ambiguous ones, the system will reduce costs associated with staff time and increase overall productivity.
  • Develop an Intelligent Disambiguation Technology: A core technological objective is to build a module that goes beyond simple string matching. The system will leverage external APIs and a weighted scoring algorithm to create a robust, evidence-based process for entity resolution.

Tasks:

To achieve these goals, the following specific outcomes will be delivered, with their success verified through clear metrics and testing.

Outcomes:

  • Development of a Data Disambiguation Algorithm: The core product will be a backend algorithm capable of comparing an imported publication with existing database entries and calculating a "match probability" score based on multiple features (author names, title, co-authors, venue, year).
  • Integration of External APIs: The algorithm will be integrated with key academic databases, specifically ORCID, DOI, and Crossref, to fetch supplementary data and strengthen the confidence of its matching decisions.
  • Creation of an Administrator Review Dashboard: A simple web application will be developed to serve as the "human-in-the-loop" interface. This dashboard will display potential matches that fall into an ambiguous score range, allowing an administrator to manually confirm or reject the proposed link.

Metrics and Verification:

Key Performance Indicators (KPIs): The success of the algorithm will be primarily measured by:

  • Precision: What percentage of the connections made automatically by the system are correct? (Target: >95%)
  • Recall: What percentage of all possible correct connections in a test set did the system successfully identify? (Target: >90%)
  • Reduction in Manual Effort: The percentage decrease in publications requiring manual review compared to the current process.

Verification Method:

  • Test Plan: The system will be validated against a pre-prepared, manually-curated "ground truth" dataset containing known matches and non-matches. The test plan will include unit tests for the scoring functions, integration tests for the API connections, and User Acceptance Testing (UAT) for the admin dashboard.
  • Acceptance Criteria: The project will be considered successful when the system can autonomously merge high-confidence matches (e.g., score > 0.85), correctly flag mid-confidence matches (e.g., score between 0.6 and 0.85) for review, and the administrator dashboard functions as specified for manual resolution.

Description

Context:

In the academic and research field, institutions rely on databases to manage, track, and showcase their scholarly output. The current process often involves importing publication data from various external sources. However, this data frequently lacks a standardized format, leading to significant inconsistencies, especially in how author names, affiliations, and publication titles are recorded.

User problems:

For database administrators and librarians, the task of manually verifying and linking each new publication is extremely slow and error-prone. An admin might have to decide if "K. Wojtkiewicz" is the same person as an existing "Krystian Wojtkiewicz," or determine if a paper is a new entry or just a republished version of an existing one. This manual process is not scalable and is costly in terms of staff time, leading to backlogs and inconsistent data.

Limitations of existing solutions:

Many current systems rely on simple, exact-match logic, which is insufficient to handle real-world variations. They fail to recognize legitimate connections between entries like "J. Smith" and "Jane Smith" or a pre-print and its final journal version. Existing solutions often lack the intelligence to query external, authoritative sources (like ORCID, Crossref, or DBLP) to gather more evidence for a potential match, representing a critical lack of automation in the data validation process.

Artifacts

Final products:

  • Bibliographic Disambiguation Engine: A machine learning (ML) weight-scaled algorithm designed to parse a database and resolve author ambiguity.
  • Performance Showcase Web Application: A public-facing web app that visualizes and demonstrates the performance and accuracy of the core disambiguation engine.

Supporting tools:

  • Admin Dashboard: A web-based interface allowing administrators to monitor the disambiguation process, review system status, and provide manual input or corrections as requested.
  • Populated Bibliographic Database: The core dataset, compatible with DBLP and ORCID schemas, populated via custom data scraping.
  • Data Scraping and Integration Modules: Scripts and modules developed to extract and ingest data from DBLP and ORCID into the local database.

Characteristics:

  • Functionality: The core ML engine disambiguates authors using a three-stage pipeline. First, it efficiently filters potential matches using a "blocking key" (e.g., last name and initial) to create a small candidate list. Second, it constructs detailed profiles (pulling publications, affiliations, etc.) from the database for each candidate pair. Finally, a custom Dynamic-Weight Algorithm analyzes and scores the similarity between these profiles to determine and link a match. This engine is supported by a public web app (for performance demonstration) and an admin dashboard (for monitoring and manual input).
  • Target Deployment Environment: A client-server architecture with a database server, a backend application server (running the ML engine), and a web server (serving the frontend and admin panel).

Beneficiaries

End Users:

The primary end users are researchers, academics, and students.

  • Improved Accuracy of Scholarly Records: Their publication lists within the institution's system will be more accurate and comprehensive. This is crucial for generating CVs, applying for grants, and undergoing performance reviews, as it ensures their entire body of work is correctly attributed to them without manual correction.
  • Enhanced Discovery: A clean database makes it easier for others to find their work, potentially leading to increased citations and collaboration opportunities.

Internal Teams:

Database administrators, librarians, and IT departments will experience direct operational improvements.

  • Time and Cost Savings: The automation of the disambiguation process will drastically reduce the hours of manual, repetitive work required to clean and link incoming data. This frees up staff for more high-value tasks and reduces operational costs.
  • Reduced Risk and Increased Data Integrity: The system minimizes human error in data entry, leading to a more reliable and trustworthy database. This is critical for internal reporting, strategic planning, and university analytics.

External Organizations:

Partner institutions, funding agencies, and university ranking bodies will also benefit.

  • Reliable Data for Evaluation: These organizations rely on accurate publication data to assess institutional and individual research output. A cleaner database provides them with more trustworthy information for evaluations, funding decisions, and compiling international rankings.
  • New Integration Opportunities: A well-structured and accurate database is easier to integrate with other systems, such as national research portals or collaborative platforms, fostering greater interoperability within the academic ecosystem.

Communities:

The broader academic and research community stands to gain from the project's impact.

  • Improved Scholarly Metrics: By contributing to a cleaner data landscape, the project helps improve the accuracy of large-scale citation analysis and scientometrics. This ensures that the measurement of research impact is based on more reliable data.
  • Fostering Knowledge Discovery: Accurate author and publication links strengthen the "scholarly graph," making it easier to identify experts in a field, discover research trends, and understand the collaborative networks that drive scientific progress.

Tech Stack

Jira Python React TensorFlow PyTorch JavaScript HTML5 CSS3 PostgresSQL NumPy Adobe-Photoshop Figma Canva D3.js Pandas Git GitHub IntelliJ-IDEA Swagger Postman
Roadmap
Repositories