Google can only see what you make available on the front end of your site, so can only index what you choose to make available.
Slightly off topic: Let's think a minute about what Automattic all knows about the average WPTavern member (and WordPress user):
- Blog IP/URL, blog setup and what plugins we use, server set up and in many cases wp.com stats;
- Many websites use Askimet, all your comments are checked;
- Most of us have a gravatar so Automattic does know at least 80% of the blogs we comment on. Every comment can easily be linked to your IP and your blogs/urls (unless you are really wary and never comment on your own blog/use a totally different Gravatar account for your own blog and another email for wp.com stats, one more email for PollDaddy, one more to comment on other sites, aso - BS, it would be really simple to query and link all these together with a rather high precision rate);
- Automattic could easily know how we vote on any Polldaddy poll (I trust anyone not to use JAP or TOR, most people do not use proxies and have unique IPs when surfing);
- Many users will have the same login/password for wp.com as for their own blogs, probably even their day-to-day email;
I trust Automattic and wouldn't use their products if I didn't but... Yes, focus on privacy should be a very important concern.
The information wordpress.org gets is not available on the internet. Some is, but not the totality. Intranets may be locked up but they still send information back to WordPress. Depending on what is written in custom plugins and themes that could contain a whole heap of business intelligence.
In the past, I've written plugins that included my name, URL, email and phone number. I sure haven't done this since 2.3 but I saw similar information which also included product names in plugins yesterday (which is what brought this to a head for me).
If someone hacks into the wordpress.org servers, which has happened in the past, they could potentially get personally-identifiable information belonging to thousands (or millions) of sites.
If someone from within WordPress decided to grab that data and market it, there is nothing anyone can do about it. WordPress is a volunteer open source project that is not a legal entity and there is nobody to sue (although some courts might hold the core team accountable).
If someone got server access they could potentially sort the data and use this for targeted mass attacks against sites running vulnerable scripts.
wordpress.org is not Automattic so the Automattic privacy policy doesn't apply, but equally, Automattic is not entitled to even see the data that is being sent back.
Loads of "potentials" in there but the collection of such data does pose a risk. This wouldn't be a problem if people were made aware of what is being collected and given the option to opt-in. Then the risk is theirs to suck up.
But aren't you complaining about the storing of URL's? And none of your reasons above seem to relate to that but to the other information that they're storing instead.
I'm actually opposed to any unnecessary data being captured. However, gotta be realistic here. There's a snowballs chance in hell that the team will accept a patch to make sure wordpress.org only captures the minimum data necessary to perform its update checks. Removing the blog URL, which is the single most significant personal information (when combined with the other data) is a very simple and easy fix.
It would take only a few minutes.
Also, the blog URL is the only part of the disclosure that people got really upset over in 2007. While I may want all the data to be anonymised I'm not pushing my personal beliefs here. For me personally, it doesn't matter much - I remove all that stuff anyway. But when people are blocking update notifications over privacy issues it can affect me when I am asked to fix hacked sites. I'd rather that they didn't have any reason to disable the update checks.
Can anybody give an actual plausible scenario in which somebody could use this data for evil purposes?
Examine the info sent:
- The plugin information is sorta required to do a plugin update check.
- The version information is useful for statistical purposes (and yes, they do statistics on them, that's how they know 11% of WP owners still run PHP 4).
- The blog URL is basically a unique identifier. Sure, you could do a unique hash or something, but why bother? It's just a URL. This is not top secret information here.
And if somebody is stupid enough to block update notifications because of their paranoia, then I'm for one am perfectly okay with them a) getting hacked, b) getting annoyed about it, and then c) not using WordPress any more.
Note: There is a filter on the blog url, BTW. It's a matter of three lines of code to make it "anonymous" while still doing update checks. Anybody blocking update checks based on paranoia needs to find somebody who knows WTF they're doing instead.
Perhaps this is worthy of a separate thread, but I'm highly confused by this attitude towards privacy. I feel that if you don't want information made public, then keep it private. If information is discoverable, then it's already public. Disclosing it again doesn't make it more public...
I once made some kind of offhanded comment about anonymity on the internet. Some joker, using nothing more than "Otto", determined my name, found pictures of me, etc, etc. Didn't really bother me, as all this info is of course publicly available, but it did seem to upset some other people.
On a separate topic, one forum I used to participate on made IP information of posters public (if you knew what link to click on, that is). Somebody made a comment about how nobody knew who he was. I got his IP, found his hometown, then went through his past posts for a few minutes and found a clue which gave me his real name. Using that, I was able to produce a Google Map satellite image of his house. I zoomed out a bit (no identifying marks), copied the image, and posted it with the obligatory line of "Oh yeah, well I can see your house from here". Hella funny, of course, but the point is that within a few minutes, you can probably find out anything about anybody.
Try it on yourself sometime, using only one piece of information, see what you can discover. I know for a fact that somebody can get my address easily enough, but not my actual phone number. I've kept that particular information on a need to know only type of level. I've been considering using a Google Voice number for online interactions, actually...
Privacy is not dead, but it is wholly within your own control. It's a matter of disclosure. If you disclose something just once, then it's there forever. And companies like Google have made forever easily searchable. And yes, people can and do aggregate that searchable information. It's not a matter of whether it's the right thing to do or not... It can be done, so it will be done. This is how you have to treat privacy, because otherwise you're just depending on the kindness of strangers.
Well, I'm soon to be implementing a WP site in a high-security intranet. It needs to be able to see out, but it would be better if WP doesn't send out a suite of information that could be used by somebody looking for vulnerabilities such as a vulnerable PHP or mySQL version.
The URL wouldn't be useful to anyone outside the intranet, but as a part of a concerted attack it could well be very useful.
Don't forget, not all WP sites are personal blogs. They can be used for the dissemination of classified information within high security organisations. If WP's default behaviour is to phone home with information about plugins and, though I don't know this for sure, who wrote them, then that could give other vectors for an interested attacker to find entry points.
Ultimately I'm ambivalent about phoning home - it's easy to check what is and is not sent, and it's easy to block, I'm just saying, that's all.
If you have a high security organization in which software can send whatever it wants out to the internet, then I'm going to say you need to fire your security guys.
Okay, I grant you that intranet uses are there, however, if you're not allowing the internet to contact that intranet site, then it really doesn't make a whole heck of a lot of difference. The URL is going to be an internal name probably (and probably something generic like "blog" or "site" or what have you. The version and plugin info is basically useless for vulnerability information, since the site isn't accessible from outside the firewall anyway... I mean, I don't see any *real* threat here. Yes, yes, it's fine to argue in the theoretical, but that's all it is. Show me the actual problem, is all I'm saying.
Also, the majority of WP uses are internet facing ones. Should we eliminate useful functionality for the majority in order to placate the extreme minority?
Short version: If somebody can already get into your network, then having them able to hack a WordPress installation on the inside via a plugin using plugin information that the system sent to wordpress.org seems like a rather unlikely threat.