Workshop on federating knowledge base construction
Workshop and submission details (from the conference site)
Please answer these questions while listening to talks:
What original knowledge do you want others to use?
What refinements can you contribute upstream?
What conversations about models and approaches can you make visible?
What tools are you making that can augment human curation?
This is an editable document; feel free to add notes below
The OKN framework and concept was launched this year in collaboration w/ NSF - we hope to form collaborations here among groups that want to contribute to such a network.
Example OKN components: Datalift, 2010-14, mandated by France
Francois mentioned the KG workshop at Columbia - Juan’s trip report on that is up at :http://www.juansequeda.com/blog/2019/05/11/2019-knowledge-graph-conference-trip-report/
More on Knowledge Graph Embeddings: https://mnick.github.io/project/knowledge-graph-embeddings/
Looking for collaborators on graph embedding this summer; has some model problems to share that would make good next steps. RPI - McCusker, McGuinness are interested in particular for those problems (a phd student in my group, in particular, is interested)
Focused on software attribution — we want to credit all sorts of work; software needs this keenly; and you need to know well your software tools to know how you understand + can reproduce data. (cf: Whole Tale, &c)
Say you find a github repo and docker instance. That’s not enough — I need other details for transformations. Data don’t map directly into the software, and need to be massaged in.
Consider harder reuse caess: dependencies, sample runs, invocations, folder context, defaults, volumes, logins? Capture semantic structure of invocation!
We started w/ the Model Metadata Registry (OntoSoft [pdf, website])
Can be extended to schema.org, data cubes, NASA QUDT, DockerPedia, SciVar Ontology.
Work done so far: Making a KG for software, via this registry. adding auto unit transforms, software image descriptions. Requirements??, Automated software composition - find out what software is available (with a possible data transformation).
If we generate new metadata, we can push updates to WD, other specialized metadata repositories. How far are we from OKN?
> Get licenses right! resources, metadata.
> Automate generation (now mostly manual) — build the botnets
> Decentralized creation
> Enrich prov traces
> Bettre interfaces for lay users? Faceted search?
Avoid a future recession or depression.
Degragment corporate knowledge bases, for research and policy analysis.
Ex: how do patent filings impact long-term profitability?
Which companies are most likely to cause the next fin crisis?
Looking at patents: entity recog, resolution, linking. using Dedupe.io
We’re looking for ways to capture sequences over time to help embed/classify items that change (companies, entities, terms). See a bonus from getting a few expert signals vs lots of weak ones. Combine w/ statistical (and weak?) models
OKN proposal: workshops for regulators+policymakers; tech contributors; corporate integration!
USPTO suggested sale of patents is relevant. Corps don’t have to tell you when they sell patents. How can we guess at this?
Beyond patents: could also look into R&D data, visa applications. [David C? has visa data]
People mined patent lit for emerging tech — FUSE (via a gov grant?) Worth following up. [Andrew also worked on this!]
bit.ly/akbc-foodkg (An RPI project using WhyIs)
Brings together Recipes, nutrition, food taxonomies
Links to existing ontologies
Straightforward to use, modular and reusable
Provenance of facts in FoodKG
Licensing is a bottleneck. In many cases, expensive (for a collab)
But interested in research collabs — check in w/ Chef App! and yummy.com and more. What sources are open? [dbpedia was too noisy]
—> checkout https://github.com/DHLab-nl/historical-recipe-web
FreeBase —> GKG: 500M entities + 3B facts ('12) —> 1B/70B (‘19)
"How do we make computers as predictably inconsistent as people?” —> How long is Harry Potter? The book or movie? What units?
How do we interpret facets of a single “entity”?
How can we get better knowledge out of what is essentially a fact-graph?
Q: Don’t you find meaningful consensus? e.g. in Wikidata?
Q: Does the community propose schemas often? A: yes, it’s no longer just participant companies.
A: s.org is not particularly good at specialized schemas, this is the tension to resolve (communities often have well defined specialized needs).
4 Challenges:
Informally defined relations between classes and instances
Multiple textual representation
Different Semantics for same textual relationship
Ambiguity / Entity Resolution
Collective alignment using data and PSL. PSL gives a probability distribution over the set of all possible alignments.
(full author list: James P. McCusker, Sabbir M. Rashid, Nkechinyere Agu, Neha Keshan, Deborah L. McGuinness)
WhyIs - http://tetherless-world.github.io/whyis/
Importance of Provenance in KG
Deduction - Inference engine using SPARQL where & construct
bit.ly/get-whis
Systems bio graph : with bio interactions — we can infer a probability that an interaction occurs. That let us find 26 potential treatments for melanoma that were discarded b/c they weren’t working in the existing patients (but were specific to certain mutations) https://peerj.com/articles/cs-106/
Also computing trust score on network data for similar content.
Prov: capturing time of access? what does that look like?
A: creation for each nanopub helps.
Separate underlays from overlays:
Capture observations & Assertions
Define shared services:
Disambiguation, resolution
Event feeds, decentralized archiving
Standard cloud services
Please share/link summaries of your own notes here!