805 Columbus Avenue
615 Interdisciplinary Science and Engineering Complex (ISEC)
Boston, MA 02120
ATTN: Christo Wilson, 635 ISEC
360 Huntington Avenue
Boston, MA 02115
Professor Wilson’s research interests are broadly focused on security, privacy, and transparency on the Web. Specific areas of interest include online tracking, the impact of algorithms on the Web, online social networks, and crowdsurfing and propaganda on social media.
- PhD in Computer Science, University of California, Santa Barbara
- MS in Computer Science, College of Engineering at University of California, Santa Barbara
- BS in Computer Science, College of Creative Studies at University of California, Santa Barbara
Christo Wilson is an Associate Professor in the College of Computer and Information Science at Northeastern University. Professor Wilson received his PhD from the University of California, Santa Barbara, working under Professor Ben Y. Zhao. He was the recipient of the Outstanding Dissertation Award from UCSB in 2012 and received a Best Paper Award at SIGCOMM in 2011. He received an NSF CAREER Award in 2016, and his work is funded by Verisign, the Data Transparency Lab, the Knight Foundation, and the European Commission.
Professor Wilson performed the first large-scale measurements of Facebook in 2008 to understand how users form friendships and interact. These insights about the behavior of normal people enabled Professor Wilson to develop novel techniques for combating spam and fake accounts on social networks, even when the attacks are perpetrated by real people instead of automated software bots. These techniques have been successfully deployed on LinkedIn and Renren. Professor Wilson has shared anonymized social network datasets with over 500 research groups around the world and continues to open-source the code and data from his work examining algorithms and personalization on the Web.
Professor Wilson helped organize the first annual ACM Conference on Online Social Networks (COSN), and continues to serve on the program committees for several conference, including WWW, IMC, and ICWSM. His work has been covered extensively in the press, including the CBS Evening News, Good Morning America, The Wall Street Journal, The Boston Globe, and The Washington Post.
Towards Methodologies and Tools for Conducting Algorithm Audits
Towards Methodologies and Tools for Conducting Algorithm Audits
This project will develop methodologies and tools for conducting algorithm audits. An algorithm audit uses controlled experiments to examine an algorithmic system, such as an online service or big data information archive, and ascertain (1) how it functions, and (2) whether it may cause harm.
Examples of documented harms by algorithms include discrimination, racism, and unfair trade practices. Although there is rising awareness of the potential for algorithmic systems to cause harm, actually detecting this harm in practice remains a key challenge. Given that most algorithms of concern are proprietary and non-transparent, there is a clear need for methods to conduct black-box analyses of these systems. Numerous regulators and governments have expressed concerns about algorithms, as well as a desire to increase transparency and accountability in this area.
This research will develop methodologies to audit algorithms in three domains that impact many people: online markets, hiring websites, and financial services. Auditing algorithms in these three domains will require solving fundamental methodological challenges, such as how to analyze systems with large, unknown feature sets, and how to estimate feature values without ground-truth data. To address these broad challenges, the research will draw on insights from prior experience auditing personalization algorithms. Additionally, each domain also brings unique challenges that will be addressed individually. For example, novel auditing tools will be constructed that leverage extensive online and offline histories. These new tools will allow examination of systems that were previously inaccessible to researchers, including financial services companies. Methodologies, open-source code, and datasets will be made available to other academic researchers and regulators. This project includes two integrated educational objectives: (1) to create a new computer science course on big data ethics, teaching how to identify and mitigate harmful side-effects of big data technologies, and (2) production of web-based versions of the auditing tools that are designed to be accessible and informative to the general public, that will increase transparency around specific, prominent algorithmic systems, as well as promote general education about the proliferation and impact of algorithmic systems.
Towards Confederated Web-Based Services
Towards Confederated Web-Based Services
This project is using cloud computing to re-architect web-based services in order to enable end users to regain privacy and control over their data. In this approach—a confederated architecture—each user provides the computing resources necessary to support her use of the service via cloud providers.
Users today have access to a broad range of free, web-based social services. All of these services operate under a similar model: Users entrust the service provider with their personal information and content, and in return, the service provider makes their service available for free by monetizing the user-provided information and selling the results to third parties (e.g., advertisers). In essence, users pay for these services by providing their data (i.e., giving up their privacy) to the provider.
This project is using cloud computing to re-architect web-based services in order to enable end users to regain privacy and control over their data. In this approach—a confederated architecture—each user provides the computing resources necessary to support her use of the service via cloud providers. All user data is encrypted and not exposed to any third-parties, users retain control over their information, and users access the service via a web browser as normal.
The incredible popularity of today’s web-based services has lead to significant concerns over privacy and user control over data. Addressing these concerns requires a re-thinking of the current popular web-based business models, and, unfortunately, existing providers are dis-incentivized from doing so. The impact of this project will potentially be felt by the millions of users who use today’s popular services, who will be provided with an alternative to the business models of today.
Towards Transparency of Personalization on the Web
Towards Transparency of Personalization on the Web
This project will develop new research methods to map and quantify the ways in which online search engines, social networks and e-commerce sites use sophisticated algorithms to tailor content to each individual user.
This project will develop new research methods to map and quantify the ways in which online search engines, social networks and e-commerce sites use sophisticated algorithms to tailor content to each individual user. This “personalization” may often be of value for the user, but it also has the potential to distort search results and manipulate the perceptions and behavior of the user. Given the popularity of personalization across a variety of Web-based services, this research has the potential for extremely broad impact. Being able to quantify the extent to which Web-based services are personalized will lead to greater transparency for users, and the development of tools to identify personalized content will allow users to access information that may be hard to access today.
Personalization is now a ubiquitous feature on many Web-based services. In many cases, personalization provides advantages for users, because personalization algorithms are likely to return results that are relevant to the user. At the same time, the increasing levels of personalization in Web search and other systems are leading to growing concerns over the Filter Bubble effect, where users are only given results that the personalization algorithm thinks they want, while other important information remains inaccessible. From a computer science perspective, personalization is simply a tool that is applied to information retrieval and ranking problems. However, sociologists, philosophers, and political scientists argue that personalization can result in inadvertent censorship and “echo chambers.” Similarly, economists warn that unscrupulous companies can leverage personalization to steer users towards higher-priced products, or even implement price discrimination, charging different users different prices for the same item. As the pervasiveness of personalization on the Web grows, it is clear that techniques must be developed to understand and quantify personalization across a variety of Web services.
This research has four primary thrusts: (1) To develop methodologies to measure personalization of mobile content. The increasing popularity of browsing the Web from mobile devices presents new challenges, as these devices have access to sensitive content like the user’s geolocation and contacts. (2) To develop systems and techniques for accurately measuring the prevalence of several personalization trends on a large number of e-commerce sites. Recent anecdotal evidence has shown instances of problematic sales tactics, including price steering and price discrimination. (3) To develop techniques to identify and quantify personalized political content. (4) To measure the extent to which financial and health information is personalized based on location and socio-economic status. All four of these thrusts will develop new research methodologies that may prove effective in other areas of research as well.
PUBLICATIONS PRODUCED AS A RESULT OF THIS RESEARCH
David Lazer, Ryan Kennedy, Gary King, Alessandro Vespignani. “The Parable of Google Flu: Traps in Big Data Analysis,” Science, v.343, 2014, p. 1203.
R. Epstein, R. Robertson, D. Lazer, C. Wilson, “Suppressing the Search Engine Manipulation Effect (SEME). Proceedings of the ACM on Human Computer Interaction, November, 2017
Internet search rankings have a significant impact on consumer choices, mainly because users trust and choose higher-ranked results more than lower-ranked results. Given the apparent power of search rankings, we asked whether they could be manipulated to alter the preferences of undecided voters in democratic elections. Here we report the results of five relevant double-blind, randomized controlled experiments, using a total of 4,556 undecided voters representing diverse demographic characteristics of the voting populations of the United States and India. The fifth experiment is especially notable in that it was conducted with eligible voters throughout India in the midst of India’s 2014 Lok Sabha elections just before the final votes were cast. The results of these experiments demonstrate that (i) biased search rankings can shift the voting preferences of undecided voters by 20% or more, (ii) the shift can be much higher in some demographic groups, and (iii) search ranking bias can be masked so that people show no awareness of the manipulation. We call this type of influence, which might be applicable to a variety of attitudes and beliefs, the search engine manipulation effect. Given that many elections are won by small margins, our results suggest that a search engine company has the power to influence the results of a substantial number of elections with impunity. The impact of such manipulations would be especially large in countries dominated by a single search engine company.
James Larisch and David Choffnes and Dave Levin and Bruce M. Maggs and Alan Mislove and Christo Wilson In Proceedings of IEEE Symposium on Security and Privacy (Oakland 2017). San Jose, CA, May, 2017
Currently, no major browser fully checks for TLS/SSL certificate revocations. This is largely due to the fact that the deployed mechanisms for disseminating revocations (CRLs, OCSP, OCSP Stapling, CRLSet, and OneCRL) are each either incomplete, insecure, inefficient, slow to update, not private, or some combination thereof. In this paper, we present CRLite, an efficient and easily-deployable system for proactively pushing all TLS certificate revocations to browsers. CRLite servers aggregate revocation information for all known, valid TLS certificates on the web, and store them in a space-efficient filter cascade data structure. Browsers periodically download and use this data to check for revocations of observed certificates in real-time. CRLite does not require any additional trust beyond the existing PKI, and it allows clients to adopt a fail-closed security posture even in the face of network errors or attacks that make revocation information temporarily unavailable. We present a prototype of name that processes TLS certificates gathered by Rapid7, the University of Michigan, and Google’s Certificate Transparency on the server-side, with a Firefox extension on the client-side. Comparing CRLite to an idealized browser that performs correct CRL/OCSP checking, we show that CRLite reduces latency and eliminates privacy concerns. Moreover, CRLite has low bandwidth costs: it can represent all certificates with an initial download of 10 MB (less than 1 byte per revocation) followed by daily updates of 580 KB on average. Taken together, our results demonstrate that complete TLS/SSL revocation checking is within reach for all clients.
Zhenhua Li and Weiwei Wang and Tianyin Xu and Xin Zhong and Xiang-Yang Li and Yunhao Liu and Christo Wilson and Ben Y. Zhao In Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2016). Santa Clara, CA, March, 2016.
As mobile cellular devices and traffic continue their rapid growth, providers are taking larger steps to optimize traffic, with the hopes of improving user experiences while reducing congestion and bandwidth costs. This paper presents the design, deployment, and experiences with Baidu TrafficGuard, a cloud-based mobile proxy that reduces cellular traffic using a network-layer VPN. The VPN connects a client-side proxy to a centralized traffic processing cloud. TrafficGuard works transparently across heterogeneous applications, and effectively reduces cellular traffic by 36% and overage instances by 10.7 times for roughly 10 million Android users in China. We discuss a large-scale cellular traffic analysis effort, how the resulting insights guided the design of TrafficGuard, and our experiences with a variety of traffic optimization techniques over one year of deployment.
Aniko Hannak and Claudia Wagner and David Garcia and Alan Mislove and Markus Strohmaier and Christo Wilson In 20th ACM Conference on Computer-Supported Cooperative Work and Social Computing (CSCW 2017). Portland, OR, February, 2017
Online freelancing marketplaces have grown quickly in recent years. In theory, these sites offer workers the ability to earn money without the obligations and potential social biases associated with traditional employment frameworks. In this paper, we study whether two prominent online freelance marketplaces – TaskRabbit and Fiverr – are impacted by racial and gender bias. From these two platforms, we collect 13,500 worker profiles and gather information about workers’ gender, race, customer reviews, ratings, and positions in search rankings. In both marketplaces, we find evidence of bias: we find that gender and race are significantly correlated with worker evaluations, which could harm the employment opportunities afforded to the workers. We hope that our study fuels more research on the presence and implications of discrimination in online environments.
Bashir, Muhammad Ahmad, Arshad, Sajjad, Robertson, William, and Wilson, Christo. Tracing Information Flows Between Ad Exchanges Using Retargeted Ads. USENIX Security Symposium, Austin, TX, USA, August 2016.
Numerous surveys have shown that Web users are concerned about the loss of privacy associated with online tracking. Alarmingly, these surveys also reveal that people are also unaware of the amount of data sharing that occurs between ad exchanges, and thus underestimate the privacy risks associated with online tracking. In reality, the modern ad ecosystem is fueled by a flow of user data between trackers and ad exchanges. Although recent work has shown that ad exchanges routinely perform cookie matching with other exchanges, these studies are based on brittle heuristics that cannot detect all forms of information sharing, especially under adversarial conditions. In this study, we develop a methodology that is able to detect client- and server-side flows of information between arbitrary ad exchanges. Our key insight is to leverage retargeted ads as a tool for identifying information flows. Intuitively, our methodology works because it relies on the semantics of how exchanges serve ads, rather than focusing on specific cookie matching mechanisms. Using crawled data on 35,448 ad impressions, we show that our methodology can successfully categorize four different kinds of information sharing behavior between ad exchanges, including cases where existing heuristic methods fail. We conclude with a discussion of how our findings and methodologies can be leveraged to give users more control over what kind of ads they see and how their information is shared between ad exchanges.