Indicate which project you’re part of after each item
Layers of analysed metadata extracted + inferred from patents — KFG, III
Metadata + inferences from scholarship + citation graphs — Scholia, SemScholar
Public identifiers (and a list of multi-identifiers) for entities — Wikidata, GRID
Your sources may welcome corrections/refinement. Imagine there is a global event stream for these updates, and you don’t have to negotiate with sources about whether or how to accept your refinements — just share that they happened, let curators find them & apply them at some point in the future.
Article disambiguation: Article names, authors, titles — from fatcat, MAG;
Patent disambiguation: from III, Lens;
Citation graph refinement — from MAG, SemSchol?;
Share embeddings and related libraries
Publish preprocessed Wikipedia data : extend
This can be as part of provenance of a collection, or as hooks / spaces for conversation among parties.
Property crosswalks: LD4L + Wikidata — property discussions on Wikidata
Authority file crosswalks: LOC + Wikidata — via LOC ID addition to Wikidata (as of May 20) [but conversations are the exception]
Bot/script evaluation for AKBC: Wikidata/WP bot approval discussions (address false positive issues + ways to resolve them)
And what's needed to integrate these into existing popular workflows or world models?
Tool templates for high-volume human curation: Toolforge (Wikidata)
Embedding services: (OKE/Data Chefs)
model embedding with learned vocabulary,
compute embedding over a dataset, or over the entire OKN
publish + store embedding : an archival store, a reference and index, so it may be found and consumed
An OKN search index? web crawling + indexing infra.
Include Common Crawl or better. Currently very unwieldy
Current monthly dumps are ~ ok.
Connect wiki draft spaces (e.g. for autogenerated articles!)
Consider a federated space for articles!
Build a fair-use + -nc friendly draft space? for glams + research on common crawl &c.
+Write up OKN improvements / from different perspectives
+Write to ML enthusiasts from the conf, to get feedback on what formats would make their work easier
++ Getting more data into UCI? (via N.Monath)
+ Tehink about how to get these data slices and services into popular tools
List challenges: it’s fine to use shared idents. When you start talking about similar things you’ll have arguments over standard representations…