Data Problems

There are several problems with the data that, if corrected, would improve the topic modeling results.

  • There are a lot of items with missing abstracts.
  • There are many spelling errors and strange characters that were probably punctuation or special characters that Excel did not translate properly from the metadata.
  • Not all titles are descriptive. Many records have very short, non-descriptive titles, many of which contain codes or cryptic words rather than words that can be lemmatized. Items with such titles and no abstract may have negatively influencedthe model. There were enough good titles to warrant their inclusion.