Выбери формат для чтения
Загружаем конспект в формате pdf
Это займет всего пару минут! А пока ты можешь прочитать работу в формате Word 👇
Nonreactive and big data in the social sciences:
methods and approaches
Mavletova Aigul Maratovna
E-mail: amavletova@hse.ru
1
Moscow 2018
About the instructors
Mavletova Aigul Maratovna
Associate professor, department of sociology HSE, Senior Research
Fellow, Laboratory for Comparative Social Research HSE.
Research interests:
•Different methodological aspects of surveys.
•Web surveys and mobile web surveys.
•Risk communication.
•Media and public opinion.
•Study of values.
Milosh Maria - teaching assistant
Research assistant, International Center for the Study of Institutions
and Development
2
About the course
.
•Introduction to R.
•Data collection from web in R.
•Practice-oriented course.
3
About the course
Lecture – Introduction to the course. Introduction to big data in
social sciences.
Seminar 1. Introduction to R. Installing packages in R. Opening
files. Working directory. Basic operations and objects in R. R
Markdown.
Seminar 2. Introduction to R. Regular expressions and essential
string functions. Basic data visualization. Basic text mining.
Seminar 3. Network analysis in R.
Seminar 4. Webscraping in R.
Seminar 5. Collecting Twitter data.
Seminar 6. Collecting data in Vkontakte. Collecting data in
Facebook.
4
What this course is NOT about
•No machine learning.
•No such cool things as SQL, Python, etc.
•No deep discussion of theories – it’s on your own – check the
program for the literature.
5
Home assignments
Form of assessment
Home assignment 1. Introduction to network analysis.
Home assignment 2. Web Scraping in R.
Home assignment 3. Data collection in Twitter.
Home assignment 4. Data collection in VK and Facebook.
Each student should perform 3 out of 4 home assignments.
6
R & R studio
•
R is a language and environment for statistical computing
and graphics
•
Freely available and maintained by volunteers
•
R is extensible; can be expanded by installing “packages”
How to get it:
•
http://www.r-project.org/ (or Google “Download R”)
•
Free to install
Highly recommended:
• R Studio: a free IDE for R http://www.rstudio.com/
• If you install R and R Studio, then you only need to run R
Studio
7
R & R studio
• R is command-line driven
• Most analyses require writing a script, which is sourced into
the R console
• R Studio makes this process easier
What’s so special about R?
• Free
• Over 4000 packages that add functionality
• Produces nice print-ready graphics
• Open-source
• Easy to install
8
Assumptions, Goals, Expectations
Assumptions
• No experience or some experience with R
Goals
To acquire the following competences:
•Skills to write basic scripts in R.
•Skills to collect online data via R.
•Skills of basic analysis of collected data.
Expectations
• You must practice and use R to learn successfully
9
Introduction to Big data
10
Big data
11
Big data
3 V:
•Volume
•Velocity
•Variety
12
Big data in social sciences
13
“The promise of the “big data” revolution is that in these data are the
answers to fundamental questions of businesses, governments, and social
sciences. Many of the most boisterous claims come from computational
fields, which have little experience with the difficulty of social scientific
inquiry.
As social scientists, we may reassure ourselves that we know better. Our
extensive experience with observational data means that we know that large
datasets alone are insufficient for solving the most pressing of society’s
problems. We even may have taught courses on how selection,
measurement error, and other sources of bias should make us skeptical of a
wide range of problems.”
14
15
Big data in social sciences
- Social media or social networking data
- Mobile devices, sensors
-Transaction data: credit cards, highway/public transport passes,
loyalty cards, phone records, browsing behavior, etc.
Etc.
16
Big data in social sciences
Three research designs which are in social sciences are
possible here as well:
-Experimental
-Non-experimental, observational
-Quasi-experimental
17
Experimental studies
18
Experiments
1
2
19
26
Internet censorship in China: Phase 1
censorship magnitude = the percent censored
within a volume burst minus the percent censored
outside all bursts.
On average,
censorship
magnitude is 27%
for collective
action,
but −1% and −4%
for policy and
news
29
Internet censorship in China: Phase 1
30
Internet censorship in China: Phase 1
31
Internet censorship in China: Phase 2
Experimental study:
- 100 social media sites, including 97 of the top blogging sites
in the country, representing 87% of blog posts.
- 3 rounds of experiments (18 to 28 April, 24 to 29 June, and
30 June to 4 July 2013) during which social media posts were
written in real time about current issues.
•12 events, 4 - collective action events.
34
Internet censorship in China: Phase 2
The causal effect on censorship of posts for or against the
government. Posts that support the government are not more or
less likely to be censored than posts that oppose the government,
within the same topic.
38
Experiments
39
Experiments
- All users 18 + in the US who accessed the Facebook website on
2 November 2010, the day of the US congressional elections.
Users were randomly assigned:
-‘social message’ group (n=60,055,176): a statement at the top of
‘News Feed’ + displayed up to six randomly selected ‘profile
pictures’ who had already clicked the ”I Voted” button
-‘informational message’ group (n=611,044) no display of friends
- control group (n=613,096)
40
Experiments
•Effect on clicking on I Voted button
•The effect on voting, through examination of public voting records
- Those who received the social message
were 0.39% more likely to vote than
users who received no message at all.
- Turnout among those who received the
informational message was identical to
turnout among those in the control group
41
Experiments
Facebook social message increased turnout by about
340,000 votes.
Strong ties between friends proved much more influential
than weak ties: “Close friends exerted about four times
more influence on the total number of validated voters
mobilized than the message itself.”
42
Experiments
43
Experiments
N = 689,003
•Condition 1: exposure to friends’ positive emotional content in News
Feed was reduced,
•Condition 2: exposure to friends’ negative emotional content in News
Feed was reduced.
•Control condition
44
Mean number of positive (Upper) and negative (Lower) emotion words (percent) generated
people, by condition.
45
Experiments
“This was part of ongoing
research companies do
to test different products,
and that was what it was;
it was poorly
communicated”
46
47
Observational studies
48
Observational studies
49
Observational studies
- Does social media produce possibility for exposing
individuals to more diverse viewpoints?
- Or has led to the creation of “echo chambers” (in which
individuals are exposed only to information from likeminded individuals)?
50
Observational studies
•Researchers examined how 10.1 million U.S. Facebook
users interact with socially shared news.
•People who self-reported their ideological affiliation on
Facebook.
•7 million distinct Web links shared by U.S. users over a
6-month period between 7 July 2014 and 7 January 2015
51
Observational studies
Of the news stories shared by liberals’ friends, 24% are
crosscutting, with 35% for conservatives.
52
Observational studies
“Despite the differences in what individuals consume
across ideological lines, our work suggests that individuals
are exposed to more cross-cutting discourse in social
media”.
53
Observational studies
Michal Kosinski, David Stillwell, and Thore Graepel Private traits and
attributes are predictable from digital records of human behavior.
PNAS | April 9, 2013 | vol. 110 | no. 15.
54
myPersonality
•myPersonality (2007-2012, www.mypersonality.org/wiki) was
a popular Facebook application that allowed users to take
psychometric tests, and allowed researchers to record their
psychological and Facebook profiles.
•The database contains more than 6,000,000 test results,
together with more than 4,000,000 individual Facebook
profiles.
•Nearly 7.5m people have completed a questionnaire.
•Users could rate the personalities of their FB friends. There
are over 300,000 friend ratings.
•About 40% of users gave access to the data on their FB
profiles
55
myPersonality
Can likes on FB predict the following?
- sexual orientation
-ethnic origin
-political views
-religion
-personality
-intelligence
-satisfaction with life
- substance use (alcohol, drugs, cigarettes)
-whether an individual’s parents stayed together until the individual
was 21 y old
- basic demographic attributes such as age, gender, relationship
status, and size and density of the friendship network
56
myPersonality
58,466 volunteers from the United States, obtained through
the myPersonality Facebook application
(www.mypersonality.org/wiki), which included their Facebook
profile information, a list of their Likes (n = 170 Likes per
person on average), psychometric test scores, and survey
information
57
myPersonality
58
59
myPersonality
60
Some problems and challenges
-Population bias: socio-demographic
correlated with presence on social media.
characteristics
are
-Self-selection within samples: partisans more likely to post
about politics (Barbera & Rivero, 2014).
-Proprietary algorithms for public data: Twitter, VK, Facebook
have all different limitations in providing opportunities for
downloading data.
- Bots.
-Human behavior is changing. How reliable can be our analysis
throughout time? Machine learning techniques are data-driven
but not theoretically driven.
-Web site algorithms (Google, Facebook, etc.) are changing
61
Some problems and challenges
62
Some problems and challenges
“Quantity of data does not mean that one can ignore
foundational issues of measurement and construct validity and
reliability and dependencies among data…
The core challenge is that most big data are not the output of
instruments designed to produce valid and reliable data
amenable for scientific analysis.”
63
Some problems and challenges
What should we think as social scientists:
- Transparency and replicability (Google, Facebook, etc should be
more transparent to replicate data and analysis by academia –
Google Flu case)
- Study the algorithm: Twitter, Facebook, Google etc. are
constantly changing because of the actions of millions of engineers
and consumers.
Researchers need a better understanding of how these changes
occur over time.
- Use big data to understand the unknown (e.g. for Google Flu:
understand the prevalence of flu at very local levels, which is not
practical for the CDC to widely produce)
64
Ethics
- Can we collect and analyze data? Can we run
experiments? Should we have the informed consent for a
particular study (on Facebook)?
- Most personal data can be de-anonymized
65
Working in R
66
Tips for working in R
• R is case-sensitive
• Comment your code so you remember what it does; comments
are preceded with #
• R scripts are simply text files with a .R extension
• Use Ctrl(or Alt) + enter to run the code (Cmd/Alt + enter on Mac)
• Use the Tab key to let R/R Studio finish typing commands for
you
• Use Shift + down arrow to highlight lines or blocks of code
• Don’t be afraid of errors; you won’t break R
• If you get stuck, Google is your friend
67