关于数据挖掘:R语言定量方法回归虚拟变量和交互项假设检验F-检验AIC-和-BIC分析学生成绩数据带自测题

24次阅读

共计 13329 个字符,预计需要花费 34 分钟才能阅读完成。

原文链接:http://tecdat.cn/?p=27578 

回归假如

省略变量偏差

如果_实在_模型包含_X_ 1 和_X_ 2,但咱们遗记了_X_ 2,那么 – 在某些状况下 – 对_X_的预计将会有偏差。OVB 须要:cor(X 1, X 2)!= 0 和 cor(X 1, y )!= 0

同方差性

为了做出无效的推断,咱们假如误差方差是恒定的 – 如果不是,咱们冒着做出谬误推断的危险(没有偏差,只影响 SE,补救措施:持重的 SE)

内生性

如果_X_影响_Y_但_Y_也影响_X_,则咱们具备内生性,这将导致估计量有偏。

虚构变量和交互

虚构变量

能够取两个值的变量,例如分数(小班、大班),也称为批示变量或二元变量。

当咱们预计这个模型时会产生什么?

值_i_ = β 0 + β 1 大_i_ + ε _i_

y__i = β_0 + _β_1_d__i + ε__i

小班的预计是多少?

大班的预计是多少?

示例:学校数据

小班的冀望分数是多少?
◦ β^0

大班的预期分数是多少?
◦ β^0 + β^1 •

小班和大班之间的冀望差别是什么?

◦ β^1
 

> summary(mol.mll)

虚构变量与回归

当咱们将虚构变量增加到具备间断解释变量的模型时会产生什么?

y__i = β_0 + _β_1_x__i + ε__i

y__i = β_0 + _β_1_x__i + β_2_d + ε__i

如果大班_d_ = 1,小班_d_ = 0,咱们失去大班:

 

 对于小班,咱们失去这个:

插图

学校数据

> del <- lm(tetcr ~ Sraio + igscol, data=dt1)

> summary(me2)

 

一个学生对每个老师的边际效应是多少?

 βSTR 比

大班有什么影响?

β ^ 大班.__学校

STratio 对小班 / 大班的影响是否雷同?

◦是的,_β_ _^ STratio_对任何区都是雷同的(平行线)

增加虚构变量能够扭转所有

交互项

 回归模型

在多元回归模型中,β ^1 形容了__X 1 的边际效应,_同时管制_了_X_ 2 的效应。内置假如_X_ 1 对所有观测值具备雷同的效应。

交互

放宽这种假如的一种办法是容许成果变动。

咱们通过应用交互来实现这一点,咱们将解释变量的乘积增加到模型中:

                         Y__i = β_0 + _β_1_X_1_i + β_2_X_2_i + β_3_X_1_i · X_2_i + ε__i

图 1

图 2

 

图 3

交互:虚构变量和回归

  • 为什么假如效应 (β 1 ) 在所有子组中都是恒定的?

  • 让咱们依据 big.school 让 STratio 产生不同的成果:

                                     y__i = β_0 + _β_1_x__i + β_2_d__i + β_3_d__i · x__i + ε__i

如果大班_d_ = 1,小班_d_ = 0,咱们失去大班:

对于小班:

> srereg(list(model1,model2, model3))

 

STratio & 大班

虚构变量能够做什么

定性信息

咱们能够将定性信息(名义变量)纳入回归模型

容许灵便模型

咱们能够应用与虚构变量的交互作用来容许不同子组中 X 的不同边际效应。

交互作用:_x_和_x_ 2

为什么要加_x_ 2?

有时咱们心愿_x_对_y_的影响是非线性的。

交互作用:_x_和_x_2

如果所有_Xi_ > 0:

  • 如果_β_ x_为正且_β _x_ 2 为负,则失去倒 U 形•如果_β_ x_为负且_β _x_ 2 为正,则失去 u 形

  • 如果两者都是侧面 / 负面的,你会失去:

让咱们看看大班

> screreg(list(model4, model5))

 

让咱们看看大班

_X_的作用是什么?

因为蕴含 X 2 项,X 减少一个单位的影响将不再具备恒定的影响。这被称为第一个差别。

解,x = 14

> s.u <- sim(zout, x=xlow,1=x.hih)

> summary(.ot)

当初 x = 21

> xow <- set(zt,STratio=21)

> x.igh <- stx(z.t,STratio=22)

 

交互作用的结果

• 成果很难从输入中立刻判断(侧面或负面)
• 边际效应不是恒定的 – 任何对于成果的陈说都取决于 x 值
 

联结假设检验

互相测试模型

辨别模型

咱们有许多测试来评估数据是否更适宜模型 A 或 B。
• F 测验:告诉您哪个模型具备更好的数据拟合。
• AIC:告诉您哪个模型更适宜,十分相似于 F 测验
• BIC:相似于 AIC,但对复杂性进行惩办
!AIC 和 BIC 是非凡的,因为这两个模型不须要互相嵌套。在雷同的样本上预计它们就足够了。

F-Test:学校规模重要吗?

> senrg(list(model6,model7,model8),str = c(0.01,0.05,0.1))

> anova

 

OLS 输入中的 F 测验

模型 1:

模型 2:

> summary(model6)

互相测试模型

为什么要惩办复杂性?

模型越简单,对数据的拟合就越好。
更简单,更少简洁。因为适度拟合数据的危险,咱们须要简洁的模型进行预测。

实践测验与预测

当咱们通过测验假如的形式测验实践时,咱们应用 F 测验。
在进行预测时,咱们心愿应用 BIC 来辨别模型。
 

AIC 和 BIC

Akaike 信息准则和贝叶斯信息准则
AIC 和 BIC 都反映了数据拟合的水平以及蕴含的解释变量的数量(复杂性)。当数据拟合更好时,度量会降落。当增加更多变量时,度量会回升,较低的值更好。
• 在试图辨别两个模型时,BIC 和 AIC 可能会给出互相矛盾的倡议
• 如果您正在测试一个实践,如果模型是嵌套的,则应用 F 测验,否则应用 AIC
• 如果您正在进行预测,请应用 AIC 或 BIC
◦ 越激进度量是 BIC
 

F 测验:示例

• 请记住:F 测验回绝了更简洁的模型
• AIC 简直不反对更简单的模型(3059.089>3057.248)
• BIC 回绝了更简单的模型
 

 screeg(list(mode6,mol8),stars = c(0.01,0.05,0.1))

 

F 测验、AIC 和 BIC

• 应用 F 测验时:

◦ 一个模型必须嵌套在另一个模型中

◦ 两个模型都必须依据雷同的观测值进行预计

• AIC 和 BIC 能够解决未嵌套的模型

• AIC 和 BIC 仅在两个模型在同一样本上预计时才无效

自测题

A politician argues that immigrant children (defined as children for whom English is not their first language) cause entire districts to perform badly on standardized tests. Use the California school data to verify if there is any relationship between share of immigrant children and average test score.

 There is a measure for how CME or LME a country is (market index=0 is fully CME, 1 is fully LME). The theory predicts that being close to either ideal type should cause lower unemployment rates due to consistent institutional arrangements. Create a model to test this statement and include at least three additional explanatory variables. Interpret the results statistically and substantially

Estimate a model explaining why somebody self-identifies as left or right on a 1-10 scale. Use at least two variables and in addition age. Is age a significant factor? Interpret st & su. In a second step, control for party affiliation and verify whether the results from the original model change or still hold

Datasets
1) SwissData2011.dta
Post-referendum survey among people living in Switzerland. The following list of variables is your
codebook:
• VoteYes {Is 1 if someone voted yes and is 0 if someone voted no.
• male {Is 1 for men and 0 for women.
• age {Age in years.
• LeftRight {Left-Right self placement where low values indicate that a respondent is more to the
left.
• GovTrust {Variable indicates a respondent’s trust in government. Little or no trust is -1, neither
YES nor NO is 0, and +1 if somebody trusts the government.
• ReligFreq {How frequently does a respondent attend a religious service? Never (0), only on special
occasions (1), several times a year (2), once a month (3), and once a week (4).
• university {Binary indicator (dummy) whether respondent has a university degree (1) or not (0).
• party {Indicates which party a respondent supports. Liberals (1), Christian Democrats (2), Social
Democrats (3), Conservative Right (4), and Greens (5).
• income {Income measured in ten different brackets (higher values indicate higher income). You
may assume that this variable is interval-scaled.
• german {Binary indicator (dummy) whether respondent’s mother tongue is German (1) or not
(0)
• suburb {Binary indicator (dummy) whether respondent lives in a suburban neighborhood (1) or
not (0)
• urban {Binary indicator (dummy) whether respondent lives in a city (1) or not (0)
• cars {Number of cars the respondent’s household owns.
• old voter {Variable indicating whether a respondent is older than 60 years (1) or not (0).
• cantonnr {Variable indicating in which of the 26 Swiss cantons a respondent lives.
• nodenomination {Share of citizens in a canton which do not have a denomination.
• urbanization {Share of citizens in a canton which live in urban areas.
 

2) \violence data v2.dta”
Cross-national data set covering the years from 1980 to 1997.
• code {Country code, alpha, 3-digit
• country {Country name
• africa {Dummy variable for Sub-Saharan African countries. (According to World Bank definition)
• agovdem80 {Anti-government demonstrations: Any peaceful public gathering of at least 100
people.
• assass80 {Assassinations: Number of assassinations per thousand population, decade average.
• blck80 {Black Market Premium: Log of 1+ black market premium, decade average.
• cabchg80 {Major Cabinet Changes: The number of times in a year that a new premier is named.
• compolt80 {Dummy =1 for countries with genocidal incident involving political victims.
• constchg80 {Major Constitutional Changes: The number of basic alternations in a state’s constitution.
• corrupti {Knack and Keefer measure of corruption (1980-89)
• coups80 {Coups d’Etat: The number of extraconstitutional or forced changes in the top government.
• deathsPC80 {Deaths from Political Violence per One Million Citizens
• democ80 {Measure of democracy (Gastil’s Political Rights)
• elf60 {Index of ethnolinguistic fractionalization, 1960. Measures probability that two randomly
drawn individuals come from the same ethno-linguistic group.
• govtcris80 {Major Government Crises: Any rapidly developing situation that threatens to bring
government down.
• gunn1 {Gunnemark1: Percent of population not speaking the official language.
• gunn2 {Gunnemark2: Percent of population not speaking the most widely used language.
• gyp80 {Growth rate of real per capita GDP.
• latinca {Dummy variable for Latin Amercia and the Carribean.
• lly80 {Financial Depth: Ratio of liquid liabilities of the financial system to GDP, dec
• lrgdp80 {Log of initial Income: Log of real per capita GDP measured at the start of each decade.
• lrgdpsq80 {Log of initial (Income squared): Log of Initial real per capita GDP squared.
• lschool80 {Log of Schooling: Log of 1 + average years of school attainment.
• ltelpw80 {Log of Telephones per worker: Log of telephones per 1000 workers
• muller {Probability of two randomly selected individuals speaking different languages.
• newspaperPC80 {Newspapers per capita
• pavroad80 {Paved Roads (percent of total)
• pop80 {Country Population
• purges80 {Purges: Any systematic elimination by jailing or execution of political opposition.
• racialt {Racial tension for 1984, 1 (low tension) to 6 (high tension)
• radiosPC80 {Radios per thousand population
• revols80 {Revolutions: Any illegal or forced change in the top governmental elite.
• riots80 {Riots: Any violent demonstration or clash of more than 100 citizens.
• roberts {Probability that two randomly selected individuals do not speak the same language.
• rulelaw80 {Rule of Law: Law and order tradition (0-6 scale)
• surp80 Fiscal Surplus/GDP: Decade average of ratio of central government surplus (+) to expenditure (-).
• tvPC80 {Televisions per thousand population
• war80 {Dummy for war on national territory during the decade
• warciv80 {Dummy for civil war
• wdiinfmt80 {Infant Mortality Rate:Number of deaths of infants under one year old per 1000
births.
• wdilabag80 {Labor force in farming/forestry/hunting/fishing as a percent of total labor force.
• wditfert80 {Fertility: The average number of children born alive to a woman in her lifetime.
 

3) WDIdata.csv
Variable descriptions and definitions are available in the WDIdata Description.csv file. This file is
available together with the exam datasets

Question 1 
This question consists of three parts (a, b, c) and you must answer all three parts. Use the dataset SwissData2011.dta and answer all parts of the question. You are hired as a political
consultant by a Swiss political party and they ask you to write a report on left-right selfidentification in the Swiss electorate.
a) Provide mean and median for age and left-right self-identification, respectively.
b) Using a t-test you should answer the following question: Are German-speaking voters
more to the right than non-German speaking voters?
c) Using a t-test you should answer the following question: Are older people more to the
right than younger people? Note: use variable old voter.

Question 2 
This question consists of three parts (a, b, c) and you must answer all three parts. Use the
dataset \violence data v2.dta” and answer all parts of the question. A widely acknowledged
measure for the extent of poverty is the child mortality rate (wdiinfmt80). Your task is to build
a statistical model explaining child mortality rates.
a) Find at least six theoretically important predictors in the supplied dataset and explain
why you would expect them to have an effect on wdiinfmt80. Estimate a linear regression
model and present your findings. Interpret your findings substantively and statistically.
You should also discuss the model quality. (Note: Do not use latinca or africa in your
model.)
b) Add two dummy variables for Africa and Latin America (africa, latinca) to your model
from part (a) in Question 2. Does the inclusion of these two variables change any of
your findings? Explain how one can statistically test whether these two variables should
be included in the model. Implement your suggested test, present and interpret the test
results. Explain how the inclusion of africa and latinca may help you deal with the omitted
variable bias problem.
c) Civil wars unleash a number of negative consequences and one of them can be wide-spread
poverty. In this third part you will analyze why wars occur. You should develop a logit
model to explain why a country experienced a civil war (warciv80). Estimate the model
and include at least four reasonable explanatory variables (no need to justify your choice,
but make sure that one variable is continuous). Assess the quality of your model by
looking at the correct predictions of civil wars from your model. Provide substantive and
statistical interpretations of the effects and provide at least one figure to illustrate the key
insight of your model.

Question 3
This question consists of four parts (a, b, c, d) and you must answer all four parts. Natural
resource wealth is often associated with poor socio-economic outcomes. This is the so-called
\resource curse” hypothesis that is most often associated with a working paper by Sachs &
Warner (1995) which was further developed in their 2001 article: \Natural Resources and
Economic Development: The curse of natural resources,” European Economic Review 45: 827-
838. Three main elements in the \curse” are negative effects of natural resource wealth on (1)
economic growth, (2) risk of conflict, and (3) political regime type. The original research paper
relied on a cross-sectional design. You’ve been hired by a think tank to produce a report on the
empirical evidence for the resource curse hypothesis using panel data. Furthermore, original
article used a general concept of natural resources. It can be argued that natural resources
differ in their impact on economic growth. For example, the effect of oil abundance may be
more pronounced than the effect of natural gas abundance. In extension of the original Sachs &
Warner argument your contract with the think tank states that you need to empirically assess
whether resource curse is detrimental to economic growth and whether various types of natural
resources have a differential effect on economic growth.
For the exercise we prepared a sample of data from World Bank Development Indicators
(WDI) WDIdata.csv. Please note that WDI variables may differ from variables used in the
original study. Instead you should choose variables for your model conceptually based on your
interpretation of the Sachs & Warner theory and any other readings you may have encountered. For the natural resource wealth, WDI provides data on rents received by the state from
specific natural resources as a percentage of GDP. You may find variable description in the
WDIdata Description.csv file.
Please note that some variables may have unequal coverage across countries and years as
some of the data may be missing. That means that depending on your choice of the variables
you may end up with a smaller effective estimation sample size than expected.
In your analysis you need to address the following questions:
a) Estimate the effect of resource curse on GDP growth rates for the full panel of countries
and years. You should estimate individual, time, and twoway fixed effects models. Explain
how your results change across these three models, and why that may be the case.
You should estimate your model using the relevant R packages as discussed in class and
present model results in a minimally formatted table as specified in the exam description.
b) Explain how you would address the issue of serial correlation in your data. Implement both
approaches that we covered in class {heteroskedasticity and autocorrelation consistent
(HAC) standard errors and lagged dependent variable (LDV) {and discuss your findings.
For the model with lagged dependent variable (LDV) also calculate immediate and longterm effects of natural resources on GDP growth.
You should estimate your model using the relevant R packages as discussed in class and
present model results in a minimally formatted table as specified in the exam description.
In addition, you need to show your calculations of the immediate (short-term) and longterm effects of natural resources on GDP growth.
c) Explain how you would address the issue of cross-sectional dependence. Implement the
Driscoll-Kraay estimator (SCC estimator) to address cross-sectional dependence and explain your results. Please note that depending on the sparsity of your data (how much
of the data is missing) the test for cross-sectional dependence may not have the expected
performance.


最受欢迎的见解

1.R 语言多元 Logistic 逻辑回归 利用案例

2. 面板平滑转移回归 (PSTR) 剖析案例实现剖析案例实现 ”)

3.matlab 中的偏最小二乘回归(PLSR)和主成分回归(PCR)

4.R 语言泊松 Poisson 回归模型剖析案例

5.R 语言混合效应逻辑回归 Logistic 模型剖析肺癌

6.r 语言中对 LASSO 回归,Ridge 岭回归和 Elastic Net 模型实现

7.R 语言逻辑回归、Naive Bayes 贝叶斯、决策树、随机森林算法预测心脏病

8.python 用线性回归预测股票价格

9.R 语言用逻辑回归、决策树和随机森林对信贷数据集进行分类预测

正文完
 0