More cutting edge — Research tools for researchers— Feb 2020

Aaron Tay
9 min readMar 8, 2020

--

It’s a new year, and here are more research tools to keep your eye out on. Most of the items are the list are either brand new, or have interesting new features to look out for.

Prior versions — July 2019, Dec 2018, July 2018

1. Semantic Scholar is enhanced with Microsoft academic graph data and now covers more than just Computer Science and Life Sciences. JISC CORE has done the same with MAG, Crossref, Pubmed.

Semantic Scholar isn’t new , but until recently it only covered limited domains. But with the recent tie up with Microsoft Academic they now have increased their coverage from 40 million records to 175 million records , covering almost every domain.

Why consider Semantic Scholar as your search tool

Semantic Scholar has the name denotes, tries to use Machine Learning to derive semantic meaning from the full text in article.

This might sound like marketing talk but it does have a few niffy features.

Firstly, it tries to determine the citations sentiment of citations. Sure a paper might have 200 citations, but people cite things for different reasons.

Scite which we will mention later, does this by classifying cites based on whether a cite is a “mention”, “supporting” cite or a “contradicting” cite.

Edit : Bianca Kramer notes that scite also indicates the paper section (Intro, Methods, Results, Discussion, Other Sections), the citation comes from.

Scite allows filtering citations by the paper sections where the citation comes from.

Whle interesting, it is unclear how accurate such automated citatons are and Semantic Scholar instead tries a different tack.

It instead classifies citations by whether it is citing results, citing methods or background. This might be useful if a paper has many cites, but you want to just look at the ones that cite the paper for reasons beyond just background.

My understanding is unlike Scite which allows filtering of citations literally by the paper section they come from, in Semantic Scholar, citing of results, methods or background is based on semantics, though I wouldn’t be surprised there is some correlation with the section of the paper they are in.

Semantic Scholar shows citations by reason for citing

The Algothrim in Semantic Scholar can also identify influencial cites, for technical details refer to this fascinating paper “Identifying Meaningful Citations” where they claim to be able to do so with a high recall rate of 0.9, but moderate precision of 0.65 by taking into account various factors such as the section of the paper where the cites appears and the number of times it appears.

Semantic Scholar also encourages you to create your own profile with your papers, and this allows you to have fun visualizations where it tries to tell you which authors have influenced you the most and which authors you influence the most based on the influencial citations made.

Profile showing authors that influenced you and authors you influenced the most

It is important to note that with the incorporation of Microsoft Academic Graph into Semantic Scholar, Semantic Scholar now has two systems of subject tags that are autogenerated. Not only are these subject tags auto-generated , they are in a hierarchy.

Topics auto-generated by Semantic Scholar

In the case of the tags generated by Semantic Scholar which appear at the article level, you can even browse to browser or narrower subjects.

On the other hand, the Microsoft Academic tags in Semantic Scholar appear to be used only at the top level for filtering to the major domains.

One level filtering using Microsoft Academic tags

Other nice features in Semantic Scholar including the functionality of filtering to detect review articles, meta-analysis (auto-generated?).

Semantic Scholar filters for review and meta-analysis

Another nice feature is how it links publication types like journal articles, conferences to “supplementary material” like blogs, news, slides, videos. So you may not have access to the full text of an article, but you may be able to get a sense of the work by the supplementary content around it e.g. slides. Or see clinical trials and news alongside the publication of results.

Blog posts linked to publication

More lengthy reading : Unpaywall Journals, InstantILL, expansion of Semantic Scholar — new developments announced during open access 2019 week that caught my eye, The rise of the “open” discovery indexes? Lens.org, Semantic Scholar and Scinapse

This idea of enhancing existing data you have with MAG is catching on.

In the UK, JISC backed CORE which claims to be “the world’s largest collection of open access research papers” by crawing and harvesting over 9,800 data providers including repositories both subject and institutional as well as journal publishers

Unlike some aggregators of open content, they harvest not just metadata via OAI-PMH but also full text and provide APIs for content discovery, Text data mining and more.

As someone who has studied how well Institutional repositories are indexed by aggregators in Singapore, I have been really impressed by how well they have done to index our repositories (and identify full text) with little support from us.

However it is well known that repository content particularly institutional repositories have very poor quality metadata in general. Many lack dois, some don’t even have all the authors or for institutional repositories change the order of the authors to favour the one from the institution among other things.

As noted in the blog post — CORE raises repository data quality by consolidating information from external datasets, they have done the huge back breaking task of processing and linking data from not only Crossref, but also MAG, Unpaywall, ORCID and Pubmed!

That’s linking over 130 million CORE records (18 million full text) to 100 million Crossref items and nearly 200 million Microsoft academic items!

This has paid off handsomely. Prior to enhancement with Crossref, they had only 25 million CORE records with dois, there are only 81 million, a substantal jump. There are also other gains in terms of metadata on Publishers etc.

Though this enhancement of metadata, does not come with any additional CORE search functions (the CORE search engine is somewhat limited), it does allow users better search results and help the work of users who use the API.

2. Scite shows citation context and is in EuroPMC, Wiley

Scite, the machine learning startup that tries to use deep learn to classify citations by whether they are “supporting”, “contradicting” or merely “mentioning” has further expanded to cover 465millio citation statements (Context Web of Science has around 1 billion).

You can try the browser plugin

But they have started to pick up partners including BMJ, LKarger, Europe PMC and in a huge move, Wiley — one of the big 5 largest publisher.

This partnership will allow scite to extract citation statements from nearly 9M articles, the entire corpus published by Wiley, thus further enriching the information the platform provides. Wiley, in turn, will use scite analytics to better understand their publishing and editorial practices through what scite calls “smart citations” — citations that not merely tell how many times an article is cited, but also provide the context for each citation and a classification, such as whether it provides supporting or contradicting evidence for the cited claim.”

Something to watch…

3. Improvement in search features in Europe PMC and Lens.org

If you are into Life Sciences, Europe PMC is something you shouldn’t miss.

It is one of the most advanced and interesting academic search engines I have seen and tends to implement some of the most cutting edge features.

The new redesigned Europe PMC , launched in Dec 2019 makes it even better.

I like the ability to search within sections as well as links to supplemental, supporting, and related or curated data for example.

Searching within sections in Europe PMC

They also link to reviews of the work and annotations.

Links to reviews

Since then, they have added Smart cites via Scite and redesigned display of preprint versions showing relationships.

Smart cites in Europe PMC , also notice data citations!
Rdesigned interface in EuroPMC for preprint versions

Of course this only works for preprint servers that link to published versions (with metadata records in Crossref), still this is a good step. Edit : See also the following comment.

See

I suspect the whole preprint confusion will need a solution like this and academic search engines like Primo/Summon will follow suit.

Lens.org is the only other academic search engine that can equal if not top Europe PMC in terms of features.

It is improving fast , some improvements are interesting like the ability to filter by number of authors which supplements the ability to search by First author or Last author only.

Or the ability to query the API with a GET release. I understand a lot more ambitious features are being planned such as improvements around the UI and controlled vocabulary. Definitely watching this closely.

4. Scholarcy preprint healthcheck API

A tool made by the Scholarcy team to check preprints. Upload any pdf, and it uses machine learning to try to extract data into structured data.

For example, it extracts affiliations, highlights key findings, figures out study subjects, statistics , tries to identify sections like data availabilty, ethical compliance etc.

Most interesting is it extracts references and runs them against various whitelists or sources like DOAJ and Crossref.

Scholarly Preprint healthcheck API

Edit : Phil Gooch of Scholarly has noted “you can check manuscripts/preprints for citations of retracted, refuted or confirmed studies. This is done via the scite and EuropePMC APIs.”

It’s still unclear how useful this is, as it is just a proof of concept tool. But one idea would be peer reviewers or journal editors to run through manuscripts with this as a first past.

5. Google Dataset search out of beta — search for dataset types and Dimensions (including free version) adds datasets

As frequent readers of this blog know, the main general bibliometric databases are Web of Science, Scopus and Google Scholar. Dimensions and Microsoft Academic are the ones that are aiming to join the party.

Dimensions main difference between Web of Science and Scopus is that they try to be inclusive, to include all publications whether they are prestigious or not. They also try to go beyond publications and index.link to grants, patents, patents, clinical trials, policy documents etc.

They have added datasets to the list and it is part of the free version.

Datasets in Dimensions

Do note that they claim only 1.4 million datasets which seems a little low, compared to what you see in competitors like Mendeley Data (20.5 million)and Google dataset search.

Part of it is because this is a new feature in Dimensions, but part of it is simply the other two covers the much broader category of “research data” which includes datasets , images , software and more while Dimensions only includes datasets.

Over at Google land, I was surprised they announced Google Dataset search was out of beta!

Given the amount of time products like Gmail, Google Scholar were in beta this seems to be a very fast turnover (it was announced in Oct 2018).

In any case, very little has changed, though you now have additional filters like Document format and usage rights

Conclusion

This is just some of the new changes in Academic search & discovery tools that caught my eye is the last few months. Clearly as the winds of change blow in Scholarly communication — our research tools are going to adapt…

--

--

Aaron Tay

A Librarian from Singapore Management University. Into social media, bibliometrics, library technology and above all libraries.