Semiautomated process for generating knowledge graphs for marginalized community doctoral-recipients

DOIhttps://doi.org/10.1108/IJWIS-02-2022-0046
Published date13 October 2022
Date13 October 2022
Pages413-431
Subject MatterInformation & knowledge management,Information & communications technology,Information systems,Library & information science,Information behaviour & retrieval,Metadata,Internet
AuthorNeha Keshan,Kathleen Fontaine,James A. Hendler
Semiautomated process for
generating knowledge graphs for
marginalized community
doctoral-recipients
Neha Keshan,Kathleen Fontaine and James A. Hendler
Department of Computer Science, Tetherless World Constellation,
Rensselaer Polytechnic Institute, Troy, NY
Abstract
Purpose This paper aims to describe theInDO: Institute Demographic Ontologyand demonstrates the
InDO-based semiautomated process for both generating and extending a knowledge graph to provide a
comprehensive resourcefor marginalized US graduate students. The knowledge graph currentlyconsists of
instances relatedto the semistructured National Science Foundation Surveyof Earned Doctorates (NSF SED)
2019 analysis reportdata tables. These tables contain summary statisticsof an institutes doctoral recipients
based on a variety of demographics. Incorporating institute Wikidata links ultimately produces a table of
unique, clearlyreadable data.
Design/methodology/approach The authors usea customized semantic extract transform and loader
(SETLr) script to ingest data from 2019 US doctoral-granting institute tables and preprocessed NSF SED
Tables 1, 3, 4 and 9. The generatedInDO knowledge graph is evaluated using two methods.First, the authors
compare competency questionssparql results from both the semiautomatically and manually generated
graphs. Second,the authors expand the questions to provide a betterpicture of an institutes doctoral-recipient
demographicswithin study elds.
Findings With some preprocessing and restructuring of the NSF SED highly interlinked tables
into a more parsable format, one can build the required knowledge graph using a semiautomated
process.
Originality/value The InDO knowledge graph allows the integrationof US doctoral-granting institutes
demographic data based on NSF SED data tables and presentation in machine-readable form using a new
semiautomatedmethodology.
Keywords Semiautomation process, Knowledge graphs, Institute demographics,
Graduate mobility, NSF doctoral recipients survey data
Paper type Research paper
© Neha Keshan, Kathleen Fontaine and James A. Hendler. Published by Emerald Publishing Limited.
This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may
reproduce, distribute, translate and create derivative works of this article (for both commercial and
non-commercial purposes), subject to full attribution to the original publication and authors. The full
terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode
This work is part of the vision of Building a Social Machine for Graduate Mobility. We would
like to thank Dean Stanley Dunn, Dean of Graduate Education, who provided expert insights into this
issue. We would like to thank the members of the Tetherless World Constellation Lab at Rensselaer
Polytechnic Institute, especially John S. Erickson and Jamie P. McCusker, who provided insights and
expertise that greatly assisted this research. This work was funded in part by the RPI-IBM AI
Research Collaboration, a member of the IBM AI Horizons network.
Marginalized
community
doctoral-
recipients
413
Received28 February 2022
Revised1 June 2022
Accepted5 July 2022
InternationalJournal of Web
InformationSystems
Vol.18 No. 5/6, 2022
pp. 413-431
EmeraldPublishing Limited
1744-0084
DOI 10.1108/IJWIS-02-2022-0046
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/1744-0084.htm
1. Introduction
Newly minted doctoral students face a long-standing problem of what comes next?.
This transition from being a student toward their chosen career path is referred to as
graduate mobility (Keshan, 2021). The preparation for graduate mobility does not start
when one is approaching graduation but rather much earlier, perhaps even as early as
the time of program selection. To help students have a smooth transition fromgraduate
school to their career, it is important for them to have an adequate amount of
information for doctoral graduate school selection. The information should include the
demographics of past doctoral recipients and the career paths they chose. Students can
use this information along with the general program ranking to make an informed
decision about which graduate program to join. Therefore, the question of what comes
nextis connected to the question of where is the best for me?during a doctoral
program selection (Keshan et al., 2021). In general, doctoral programs are challenging
for all students but can be especially challenging for students from marginalized
communities groups of students traditionally under-represented based on ethnicity,
race, language, gender identity, age, physical ability and/or immigration status (Gay,
2004;Sevelius et al., 2020). It has been shown that marginalized students have to go the
extra mile to prove their worth.
Previous work (Keshan et al., 2021) proposed an Institute Demographic Ontology
(InDO) designed to help with this problem. The ontology was m ainly generated
manually using a traditional methodology (Kendall and McGuiness, 2019).Thispaper
builds on that work by describing a new, semiautomated process for generating an
Institute Demographic knowledge graph, based on the InDO ontology, to integrate the
various NSF SED survey results statistical data (Foley, 2021). Notably, National
Science Foundation (NSF) recently (Dec 2021) launched the Survey of Earned
Doctorates Restricted Data Analysis System(SED RDAS), which allows users to
create their own tables for SED data from 2017 to 2020. In this restrictive model,
security protocols in the NSF system do not allow the user to acquire institute-specic
demographics with respect to the year. However, the institute-specic data is available
through the NSF website as part of their SED analysis results across multiple tables for
the years 1958 to 2020. These tables (Figure 1) can be integrated with one another using
semantic techniques without compromising privacy to make the statistical data more
machine-readable and, therefore, more accessible, providing a more comprehensive
picture of any US doctorate-granting institutes demographics. This system integrates
the available institute data from the provided results table without compromising
student privacy.
In this paper, we describe a semiautomated linked-data representation of the NSF SED
statistical data, knowledge representation of this statistical, demographic data and the
usefulness of linking it with Wikidata [1]. Wikidata is a free and open knowledge base that
can be processed by both humans and machines.The content of Wikidata, available under a
free creative commons license, is interlinkable to other open data sets on the linked data
Web. Our current InDO-based semiautomatically generated knowledge graphincludes data
points from Tables 1, 3, 4 and 9 of the published NSF SED 2019 analysis results. One
hundred and ninety-four of the 448 doctoral-granting US institutes have their respective
Wikidata nodes added to allow users to access our resources in conjunction with other
linked data already available on the Web. Finally, as part of the evaluation, we compared
blazegraph workbench results obtained from the semiautomatically generated knowledge
graph and the manually generated knowledge graph. We also added new competency
questions to provide a better picture of an institutes demographic based on broad study
IJWIS
18,5/6
414

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT