Wikipedia announced
they have decided to give away their search data to the public for
free. Yea, they would just give away search data to anyone would wanted
to download it. Shortly after they announced this, they decided to
“temporarily taken down this data to make additional improvements to the
anonymization protocol related to the search queries.”
My first reaction when I saw that Wikipedia was releasing this
information was, privacy issue! Imagine how people use Wikipedia. They
may search for family information, medical conditions, religious
beliefs, political beliefs and so on. If you can match those search
patterns to the same user (i.e. their IP address), you can technically
back track who the searcher is and build a profile of the user and their
beliefs and tastes. Great for marketers, but potentially horrible for
the privacy of the searcher.
Back in 2006, AOL released this data and was blasted for doing so. In fact, the New York Times profiles one of those searchers, Searcher No. 4417749 to prove this point. Heck, they even made a movie around this leaked search data.
So when I heard Wikipedia is doing the same, I was a bit surprised.
Why did they decide to release it? Well, they listed three reasons:
(1) it provides valuable feedback to our editor community, who can
use it to detect topics of interest that are currently insufficiently
covered.
(2) we can improve our search index by benchmarking improvements against
real queries.
(3) we give outside researchers the opportunity to discover gems in the
data.
The data includes:
- Server hostname
- Timestamp (UTC)
- Wikimedia project
- URL encoded search query
- Total number of results
- Lucene score of best match
- Interwiki result
- Namespace (coded as integer)
- Namespace (human-readable)
- Title of best matching article
Again, Wikipedia has pulled down the data until they can figure out how to better anonymize the data.
No comments:
Post a Comment