Don't try to block out the sun with your fingers!
Nicolas Rodriguez
OWASP AppSec Latam 2012
Date published
webapps owasp


The massive adoption of social networks where personal data is stored led to many privacy challenges.
Users of these applications trust them and are usually unaware of the privacy risks involved.
Social Applications like Facebook, Twitter and LinkedIn must keep a balance between being functional
and limiting information harvesting activities. The privacy options you set up for your account must
be enforced at all cost. These web applications use a combination of techniques to limit automatic
information extraction that may include an arrangement of security tokens, URL rewriting and lots of
JavaScript code, aimed at making it hard, if not impossible, for a standard web crawler to navigate
throughout the social network content.
A frequent approach found across most of the protections against information harvesting is to make it
difficult for an automatic system to navigate the application, somehow distinguishing between automated
and human activity. For most of these protections, automated navigation is understood as the process of
fetching a page, parsing its contents, extracting the target URLs to start over again with the process.
Some of them additionally fingerprint what is called the expected navigation flow and behavior aimed at
detecting abnormal activity.
During recent years test-driven development (TDD) tools have provided a novel and practical way to
programmatically interact with web browsers, enabling developers and testers to take advantage of the
browser’s power through easy to write automation scripts, when developing and testing web applications.
In this talk we will show how test-driven development could be used to write a new generation of web
crawlers capable of using the most powerful tool available for that means, the web browser. We also
present a target based solution that works on a real world scenario.
The techniques described in the talk will shed some light on how information can be harvested by driving
a browser natively as a user would do. They make use of Selenium WebDriver, a suite of tools to automate
web browsers, Python and Mozilla Firefox.
We conclude that the techniques analyzed aimed at limiting information harvesting were not effective at
stopping a web crawler built on the premises presented here. Additional mitigations are discussed as a
simple way to make the application flow less predictable and more robust against information harvesting.