Nonreactive and big data in the social sciences: methods and approaches

⌛ 2018 год
👀 933 просмотра
📌 880 загрузок
🏢️ HSE

Выбери формат для чтения

Конспект лекции по дисциплине «Nonreactive and big data in the social sciences: methods and approaches», pdf

Загружаем конспект в формате pdf

Это займет всего пару минут! А пока ты можешь прочитать работу в формате Word 👇

Конспект лекции по дисциплине «Nonreactive and big data in the social sciences: methods and approaches», Word формат

Nonreactive and big data in the social sciences: methods and approaches Mavletova Aigul Maratovna E-mail: amavletova@hse.ru 1 Moscow 2018 About the instructors Mavletova Aigul Maratovna Associate professor, department of sociology HSE, Senior Research Fellow, Laboratory for Comparative Social Research HSE. Research interests: •Different methodological aspects of surveys. •Web surveys and mobile web surveys. •Risk communication. •Media and public opinion. •Study of values. Milosh Maria - teaching assistant Research assistant, International Center for the Study of Institutions and Development 2 About the course . •Introduction to R. •Data collection from web in R. •Practice-oriented course. 3 About the course Lecture – Introduction to the course. Introduction to big data in social sciences. Seminar 1. Introduction to R. Installing packages in R. Opening files. Working directory. Basic operations and objects in R. R Markdown. Seminar 2. Introduction to R. Regular expressions and essential string functions. Basic data visualization. Basic text mining. Seminar 3. Network analysis in R. Seminar 4. Webscraping in R. Seminar 5. Collecting Twitter data. Seminar 6. Collecting data in Vkontakte. Collecting data in Facebook. 4 What this course is NOT about •No machine learning. •No such cool things as SQL, Python, etc. •No deep discussion of theories – it’s on your own – check the program for the literature. 5 Home assignments Form of assessment Home assignment 1. Introduction to network analysis. Home assignment 2. Web Scraping in R. Home assignment 3. Data collection in Twitter. Home assignment 4. Data collection in VK and Facebook. Each student should perform 3 out of 4 home assignments. 6 R & R studio • R is a language and environment for statistical computing and graphics • Freely available and maintained by volunteers • R is extensible; can be expanded by installing “packages” How to get it: • http://www.r-project.org/ (or Google “Download R”) • Free to install Highly recommended: • R Studio: a free IDE for R http://www.rstudio.com/ • If you install R and R Studio, then you only need to run R Studio 7 R & R studio • R is command-line driven • Most analyses require writing a script, which is sourced into the R console • R Studio makes this process easier What’s so special about R? • Free • Over 4000 packages that add functionality • Produces nice print-ready graphics • Open-source • Easy to install 8 Assumptions, Goals, Expectations Assumptions • No experience or some experience with R Goals To acquire the following competences: •Skills to write basic scripts in R. •Skills to collect online data via R. •Skills of basic analysis of collected data. Expectations • You must practice and use R to learn successfully 9 Introduction to Big data 10 Big data 11 Big data 3 V: •Volume •Velocity •Variety 12 Big data in social sciences 13 “The promise of the “big data” revolution is that in these data are the answers to fundamental questions of businesses, governments, and social sciences. Many of the most boisterous claims come from computational fields, which have little experience with the difficulty of social scientific inquiry. As social scientists, we may reassure ourselves that we know better. Our extensive experience with observational data means that we know that large datasets alone are insufficient for solving the most pressing of society’s problems. We even may have taught courses on how selection, measurement error, and other sources of bias should make us skeptical of a wide range of problems.” 14 15 Big data in social sciences - Social media or social networking data - Mobile devices, sensors -Transaction data: credit cards, highway/public transport passes, loyalty cards, phone records, browsing behavior, etc. Etc. 16 Big data in social sciences Three research designs which are in social sciences are possible here as well: -Experimental -Non-experimental, observational -Quasi-experimental 17 Experimental studies 18 Experiments 1 2 19 26 Internet censorship in China: Phase 1 censorship magnitude = the percent censored within a volume burst minus the percent censored outside all bursts. On average, censorship magnitude is 27% for collective action, but −1% and −4% for policy and news 29 Internet censorship in China: Phase 1 30 Internet censorship in China: Phase 1 31 Internet censorship in China: Phase 2 Experimental study: - 100 social media sites, including 97 of the top blogging sites in the country, representing 87% of blog posts. - 3 rounds of experiments (18 to 28 April, 24 to 29 June, and 30 June to 4 July 2013) during which social media posts were written in real time about current issues. •12 events, 4 - collective action events. 34 Internet censorship in China: Phase 2 The causal effect on censorship of posts for or against the government. Posts that support the government are not more or less likely to be censored than posts that oppose the government, within the same topic. 38 Experiments 39 Experiments - All users 18 + in the US who accessed the Facebook website on 2 November 2010, the day of the US congressional elections. Users were randomly assigned: -‘social message’ group (n=60,055,176): a statement at the top of ‘News Feed’ + displayed up to six randomly selected ‘profile pictures’ who had already clicked the ”I Voted” button -‘informational message’ group (n=611,044) no display of friends - control group (n=613,096) 40 Experiments •Effect on clicking on I Voted button •The effect on voting, through examination of public voting records - Those who received the social message were 0.39% more likely to vote than users who received no message at all. - Turnout among those who received the informational message was identical to turnout among those in the control group 41 Experiments Facebook social message increased turnout by about 340,000 votes. Strong ties between friends proved much more influential than weak ties: “Close friends exerted about four times more influence on the total number of validated voters mobilized than the message itself.” 42 Experiments 43 Experiments N = 689,003 •Condition 1: exposure to friends’ positive emotional content in News Feed was reduced, •Condition 2: exposure to friends’ negative emotional content in News Feed was reduced. •Control condition 44 Mean number of positive (Upper) and negative (Lower) emotion words (percent) generated people, by condition. 45 Experiments “This was part of ongoing research companies do to test different products, and that was what it was; it was poorly communicated” 46 47 Observational studies 48 Observational studies 49 Observational studies - Does social media produce possibility for exposing individuals to more diverse viewpoints? - Or has led to the creation of “echo chambers” (in which individuals are exposed only to information from likeminded individuals)? 50 Observational studies •Researchers examined how 10.1 million U.S. Facebook users interact with socially shared news. •People who self-reported their ideological affiliation on Facebook. •7 million distinct Web links shared by U.S. users over a 6-month period between 7 July 2014 and 7 January 2015 51 Observational studies Of the news stories shared by liberals’ friends, 24% are crosscutting, with 35% for conservatives. 52 Observational studies “Despite the differences in what individuals consume across ideological lines, our work suggests that individuals are exposed to more cross-cutting discourse in social media”. 53 Observational studies Michal Kosinski, David Stillwell, and Thore Graepel Private traits and attributes are predictable from digital records of human behavior. PNAS | April 9, 2013 | vol. 110 | no. 15. 54 myPersonality •myPersonality (2007-2012, www.mypersonality.org/wiki) was a popular Facebook application that allowed users to take psychometric tests, and allowed researchers to record their psychological and Facebook profiles. •The database contains more than 6,000,000 test results, together with more than 4,000,000 individual Facebook profiles. •Nearly 7.5m people have completed a questionnaire. •Users could rate the personalities of their FB friends. There are over 300,000 friend ratings. •About 40% of users gave access to the data on their FB profiles 55 myPersonality Can likes on FB predict the following? - sexual orientation -ethnic origin -political views -religion -personality -intelligence -satisfaction with life - substance use (alcohol, drugs, cigarettes) -whether an individual’s parents stayed together until the individual was 21 y old - basic demographic attributes such as age, gender, relationship status, and size and density of the friendship network 56 myPersonality 58,466 volunteers from the United States, obtained through the myPersonality Facebook application (www.mypersonality.org/wiki), which included their Facebook profile information, a list of their Likes (n = 170 Likes per person on average), psychometric test scores, and survey information 57 myPersonality 58 59 myPersonality 60 Some problems and challenges -Population bias: socio-demographic correlated with presence on social media. characteristics are -Self-selection within samples: partisans more likely to post about politics (Barbera & Rivero, 2014). -Proprietary algorithms for public data: Twitter, VK, Facebook have all different limitations in providing opportunities for downloading data. - Bots. -Human behavior is changing. How reliable can be our analysis throughout time? Machine learning techniques are data-driven but not theoretically driven. -Web site algorithms (Google, Facebook, etc.) are changing 61 Some problems and challenges 62 Some problems and challenges “Quantity of data does not mean that one can ignore foundational issues of measurement and construct validity and reliability and dependencies among data… The core challenge is that most big data are not the output of instruments designed to produce valid and reliable data amenable for scientific analysis.” 63 Some problems and challenges What should we think as social scientists: - Transparency and replicability (Google, Facebook, etc should be more transparent to replicate data and analysis by academia – Google Flu case) - Study the algorithm: Twitter, Facebook, Google etc. are constantly changing because of the actions of millions of engineers and consumers. Researchers need a better understanding of how these changes occur over time. - Use big data to understand the unknown (e.g. for Google Flu: understand the prevalence of flu at very local levels, which is not practical for the CDC to widely produce) 64 Ethics - Can we collect and analyze data? Can we run experiments? Should we have the informed consent for a particular study (on Facebook)? - Most personal data can be de-anonymized 65 Working in R 66 Tips for working in R • R is case-sensitive • Comment your code so you remember what it does; comments are preceded with # • R scripts are simply text files with a .R extension • Use Ctrl(or Alt) + enter to run the code (Cmd/Alt + enter on Mac) • Use the Tab key to let R/R Studio finish typing commands for you • Use Shift + down arrow to highlight lines or blocks of code • Don’t be afraid of errors; you won’t break R • If you get stuck, Google is your friend 67