关于算法:拓端tecdatR语言对NASA元数据进行文本挖掘的主题建模分析

什么是主题建模？

获取和整顿 NASA 元数据

制作 DocumentTermMatrix

LDA 主题建模

摸索建模

每个文档都属于哪个主题？

将主题建模连贯到关键字

NASA 有 32,000 多个数据集，咱们有趣味理解这些数据集之间的分割，以及与 NASA 以外其余政府组织中其余重要数据集的分割。让咱们应用主题建模对形容字段进行分类，而后将其连贯到关键字。

主题建模是一种无监督的文档分类办法。此办法将每个文档建模为主题的混合，将每个主题建模为单词的混合。我将在这里用于主题建模的办法称为潜在 Dirichlet 调配（LDA），但还有其余适宜主题模型的可能性。在本文中，每个数据集形容都是一个文档。咱们将看看是否能够将这些形容文本作为主题进行建模。

让咱们下载 32,000 多个 NASA 数据集的元数据。

library(jsonlite)
library(dplyr)
library(tidyr)
names(metadata$dataset)

##  \[1\] "_id"                "@type"              "accessLevel"        "accrualPeriodicity"
##  \[5\] "bureauCode"         "contactPoint"       "description"        "distribution"      
##  \[9\] "identifier"         "issued"             "keyword"            "landingPage"       
## \[13\] "language"           "modified"           "programCode"        "publisher"         
## \[17\] "spatial"            "temporal"           "theme"              "title"             
## \[21\] "license"            "isPartOf"           "references"         "rights"            
## \[25\] "describedBy"

nasadesc <- data\_frame(id = metadata$dataset$`\_id`$`$oid`, desc = metadata$dataset$description)
nasakeyword <- data\_frame(id = metadata$dataset$`\_id`$`$oid`, 
                          keyword = metadata$dataset$keyword) %>%
        unnest(keyword)
nasakeyword <- nasakeyword %>% mutate(keyword = toupper(keyword))

检查一下，最罕用的关键字是什么？

nasakeyword %>% group_by(keyword) %>% count(sort = TRUE)

## # A tibble: 1,616 x 2
##                    keyword     n
##                      <chr> <int>
## 1            EARTH SCIENCE 14386
## 2                   OCEANS 10033
## 3                  PROJECT  7463
## 4             OCEAN OPTICS  7324
## 5               ATMOSPHERE  7323
## 6              OCEAN COLOR  7270
## 7                COMPLETED  6452
## 8  ATMOSPHERIC WATER VAPOR  3142
## 9             LAND SURFACE  2720
## 10               BIOSPHERE  2449
## # ... with 1,606 more rows

要进行主题建模，咱们须要从 tm 包中创立一种非凡的矩阵（当然，“文档矩阵”只是一个通用概念）。行对应于文档（在本例中为形容文字），列对应于术语（即单词）；它是一个稠密矩阵。

让咱们应用停用词来清理一下文本，以除去 HTML 或其余字符编码中残留的一些无用“词”。

## # A tibble: 1,909,215 x 3
##                          id     word     n
##                       <chr>    <chr> <int>
## 1  55942a8ec63a7fe59b4986ef     suit    82
## 2  55942a8ec63a7fe59b4986ef    space    69
## 3  56cf5b00a759fdadc44e564a     data    41
## 4  56cf5b00a759fdadc44e564a     leak    40
## 5  56cf5b00a759fdadc44e564a     tree    39
## 6  55942a8ec63a7fe59b4986ef pressure    34
## 7  55942a8ec63a7fe59b4986ef   system    34
## 8  55942a89c63a7fe59b4982d9       em    32
## 9  55942a8ec63a7fe59b4986ef       al    32
## 10 55942a8ec63a7fe59b4986ef    human    31
## # ... with 1,909,205 more rows

当初让咱们来创立 DocumentTermMatrix。

## <<DocumentTermMatrix (documents: 32003, terms: 35911)>>
## Non-/sparse entries: 1909215/1147350518
## Sparsity           : 100%
## Maximal term length: 166
## Weighting          : term frequency (tf)

当初，让咱们应用 topicmodels 包创立一个 LDA 模型。咱们将通知算法进行多少个主题？这个问题很像 k -means 聚类中的问题；咱们不提前晓得。咱们能够尝试一些不同的值，查看模型如何拟合文本。让咱们从 8 个主题开始。

## A LDA_VEM topic model with 8 topics.

这是一种随机算法，依据算法的起始地位，其后果可能会有所不同。

让咱们整顿模型，看看咱们能找到什么。

## # A tibble: 287,288 x 3
##    topic  term         beta
##    <int> <chr>        <dbl>
## 1      1  suit 2.591273e-40
## 2      2  suit 9.085227e-61
## 3      3  suit 1.620165e-61
## 4      4  suit 2.081683e-64
## 5      5  suit 9.507092e-05
## 6      6  suit 5.747629e-04
## 7      7  suit 1.808279e-63
## 8      8  suit 4.545037e-40
## 9      1 space 2.332248e-05
## 10     2 space 2.641815e-40
## # ... with 287,278 more rows

β 列通知咱们从该主题的文档中生成该单词的可能性。

每个主题的前 5 个词是什么？
top_terms

## # A tibble: 80 x 3
##    topic         term        beta
##    <int>        <chr>       <dbl>
## 1      1         data 0.047596842
## 2      1          set 0.014857522
## 3      1         soil 0.013231077
## 4      1         land 0.007874196
## 5      1        files 0.007835032
## 6      1     moisture 0.007799017
## 7      1      surface 0.006913904
## 8      1         file 0.006495391
## 9      1    collected 0.006350559
## 10     1 measurements 0.005521037
## # ... with 70 more rows

让咱们看一下。

ggplot(top_terms, aes(beta, term, fill = as.factor(topic))) +
        geom_barh(stat = "identity", show.legend = FALSE, alpha = 0.8)

咱们能够看到在这些形容文本中占主导地位的词“数据”是什么。从对于土地和土地的词语到对于设计，零碎和技术的词语，这些词语汇合之间的确存在着有意义的差别。相对须要进一步摸索，以找到适合数量的主题并在这里做得更好。另外，题目和形容词是否能够联合用于主题建模？

让咱们找出哪些主题与哪些形容字段（即文档）相关联。

lda_gamma

## # A tibble: 256,024 x 3
##                    document topic        gamma
##                       <chr> <int>        <dbl>
## 1  55942a8ec63a7fe59b4986ef     1 7.315366e-02
## 2  56cf5b00a759fdadc44e564a     1 9.933126e-02
## 3  55942a89c63a7fe59b4982d9     1 1.707524e-02
## 4  56cf5b00a759fdadc44e55cd     1 4.273013e-05
## 5  55942a89c63a7fe59b4982c6     1 1.257880e-04
## 6  55942a86c63a7fe59b498077     1 1.078338e-04
## 7  56cf5b00a759fdadc44e56f8     1 4.208647e-02
## 8  55942a8bc63a7fe59b4984b5     1 8.198155e-05
## 9  55942a6ec63a7fe59b496bf7     1 1.042996e-01
## 10 55942a8ec63a7fe59b4986f6     1 5.475847e-05
## # ... with 256,014 more rows

此处的 γ 列是每个文档属于每个主题的概率。请留神，有些非常低，有些更高。概率如何散布？

ggplot(lda_gamma, aes(gamma, fill = as.factor(topic))) +
        geom_histogram(alpha = 0.8, show.legend = FALSE) +
        facet_wrap(~topic, ncol = 4) +
        scale\_y\_log10()

y 轴在此处以对数刻度绘制，因而咱们能够看到一些货色。大多数文档都被归类为以下主题之一：许多文档被归类为主题 2，而文档被归类为主题 1 和 5 则较不明确。一些主题的文档较少。对于任何单个文档，咱们都能够找到它具备最高归属概率的主题。

让咱们将这些主题模型与关键字分割起来，看看会产生什么。让咱们将此数据框增加到关键字，而后查看哪些关键字与哪个主题相关联。

lda_gamma

## # A tibble: 1,012,727 x 4
##                    document topic        gamma                     keyword
##                       <chr> <int>        <dbl>                       <chr>
## 1  55942a8ec63a7fe59b4986ef     1 7.315366e-02        JOHNSON SPACE CENTER
## 2  55942a8ec63a7fe59b4986ef     1 7.315366e-02                     PROJECT
## 3  55942a8ec63a7fe59b4986ef     1 7.315366e-02                   COMPLETED
## 4  56cf5b00a759fdadc44e564a     1 9.933126e-02                    DASHLINK
## 5  56cf5b00a759fdadc44e564a     1 9.933126e-02                        AMES
## 6  56cf5b00a759fdadc44e564a     1 9.933126e-02                        NASA
## 7  55942a89c63a7fe59b4982d9     1 1.707524e-02 GODDARD SPACE FLIGHT CENTER
## 8  55942a89c63a7fe59b4982d9     1 1.707524e-02                     PROJECT
## 9  55942a89c63a7fe59b4982d9     1 1.707524e-02                   COMPLETED
## 10 56cf5b00a759fdadc44e55cd     1 4.273013e-05                    DASHLINK
## # ... with 1,012,717 more rows

让咱们保留属于某个主题的文档（概率 \> 0.9），而后为每个主题找到最重要的关键字。

top_keywords

## Source: local data frame \[1,240 x 3\]
## Groups: topic \[8\]
## 
##    topic       keyword     n
##    <int>         <chr> <int>
## 1      2   OCEAN COLOR  4480
## 2      2  OCEAN OPTICS  4480
## 3      2        OCEANS  4480
## 4      1 EARTH SCIENCE  3469
## 5      5       PROJECT  3464
## 6      5     COMPLETED  3057
## 7      8 EARTH SCIENCE  2229
## 8      3   OCEAN COLOR  1968
## 9      3  OCEAN OPTICS  1968
## 10     3        OCEANS  1968
## # ... with 1,230 more rows

咱们也对它们进行可视化。

ggplot(top_keywords, aes(n, keyword, fill = as.factor(topic)))