python爬虫-urllib模块

时间:2023-03-10 06:59:06
python爬虫-urllib模块

  urllib 模块是一个高级的 web 交流库,其核心功能就是模仿web浏览器等客户端,去请求相应的资源,并返回一个类文件对象。urllib 支持各种 web 协议,例如:HTTP、FTP、Gopher;同时也支持对本地文件进行访问。但一般而言多用来进行爬虫的编写,而下面的内容也是围绕着如何使用 urllib 库去编写简单的爬虫。另外,如果要爬取 js 动态生成的东西, 如 js 动态加载的图片,还需要一些高级的技巧,这里的例子都是针对于静态的 html 网页的。

  下面的说明都是针对于 python2.7 版本而言的,版本间存在差距,具体参考官方手册。

  首先,如果我需要写一个爬虫,去爬取一个网站的图片的话,可以分为以下几步:

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAArkAAACUCAIAAAAYkaYiAAAX10lEQVR4nO2dPa7jNheGsyZ/TYoECFIkC/BKJsIgwAXSBIMsQBuJkSZNmjRppvEmZgO3SBXAX6E//hxKpC2/NO3nrWZ0afIxDyW+OqKpry6R3t/f44MpffnyJb9wUc33KwyzpjDMmsIwawrDrCkMs6ZwKfNX4vYeoTDMmsIwawrDrCkMs6YwzJrCeIVtwawpDLOmMMyawjBrCsOsKYxX2BbMmsIwawrDrCkMs6YwzJrCeIVtwawpDLOmMMyawjBrCsOsKYxX2BbMmsIwawrDrCkMs6YwzJrCeIVtwawpDLOmMMyawjBrCsOsKVzsFd5v05cvX26sQS+YNYJZI5g1glkjmDUqZSavUA0DZg0GzBoMmDUYMGswYA5qxitUw4BZgwGzBgNmDQbMGgyYg5rxCtUwYNZgwKzBgFmDAbMGA+agZrxCNQyYNRgwazBg1mDArMGAOagZr1ANA2YNBswaDJg1GDBrMGAOasYrVMOAWYMBswYDZg0GzBoMmIOa8QrVMGDWYMCswYBZgwGzBgPmoGa8QjUMmDUYMGswYNZgwKzBgDmoGa9QDQNmDQbMGgyYNRgwazBgDmrGK1TDgFmDAbMGA2YNBswaDJiDmvEK1TBg1mDArMGAWYMBswYD5qBmvEI1DJg1GDBrMGDWYMCswYA5qBmvUA0DZg0GzBoMmDUYMGswYA5qxitUw4BZgwGzBgNmDQbMGgyYg5p5J3UbglkjmDWCWSOYNXoFZvIK1TBg1mDArMGAWYMBswYD5qBmvEI1DJg1GDBrMGDWYMCswYA5qBmvUA0DZg0GzBoMmDUYMGswYA5qxitUw4BZgwGzBgNmDQbMGgyYg5rxCtUwYNZgwKzBgFmDAbMGA+agZrxCNQyYNRgwazBg1mDArMGAOagZr1ANA2YNBswaDJg1GDBrMGAOasYrVMOAWYMBswYDZg0GzBoMmIOa8QrVMGDWYMCswYBZgwGzBgPmoGa8QjUMmDUYMGswYNZgwKzBgDmoGa9QDQNmDQbMGgyYNRgwazBgDmrGK1TDgFmDAbMGA2YNBswaDJiDmvEK1TBg1mDArMGAWYMBswYD5qBmvEI1DJg1GDBrMGDWYMCswYA5qPmrL5He39/jg7uoqOb7FS4SzFcXLhLMVxcuEsxXFy4SzFcXLhLMVxcuEnmFmhhPz/zhw4cDEurt7a1iuN3Cb29vtTujJX38+PFOQcm5bhCsIplnWYvX51JmvEI1jKdnrn1Sv6IqhtstXLsb2tOdgpJz3aj91dvTdf18XQQfZ07BK1TDeHrm1HmF7qHM3lZ6hfzPvrJK++pOXiG/zlcWXkHX3iMUhllTmGuQUniFRoVXaEh4BV17j1AYZk1hrkFK4RUaFV6hIeEV9mmPNTIabS5he5CxONDml0e3KLO38QqPptK+witUVKqv8Apl7elmy5fXZlzzI4hXeA5l9jZe4dFU2ld4hYpK9RVeoaw9xpxAeAVkCq/QqPAKDQmvsE97jDmB8ArIFF6hUeEVGhJeYZ/2GHMC4RWQKbxCo8IrNCS8wj7tMeYEwisgU3iFRoVXaEh4hX3aY8wJhFdApvAKjQqv0JDwCvu0x5gTCK+ATOEVGhVeoSHhFfZpjzEnEF4BmcIrNCq8QkN6aa/wfpuGt1UOYswJNHTyjVHTiPGg1EMNjILQn7pDd0r+dz64pu60XWYud+6P/tFjf7abP/dHnyU6EOCO/zn3R6/ODd01cO71+fZgnfuj8/3NUM0Fj/056uopBnEQ/EB4PXjq/BDZ3XvqwoM+bPpgUbhSwcrp50dTKTN5hcaU08lFEbxfYcaDUpm9rRkb+aEPJ5yUV1gOBtf25IyVmAP82WIqNNXi1BaYj+4UupGh9mUqmz57lVfILl4zr+D0XWzN4tk96oggVpYZiz6YNUAcrzB/9o5eIT7+EnmFHdtjbhAIr4BMtekVTp0xCVvz8d5eYfxr0isM/3DuaceS8f3rWNS8VQ6yFqZa8ArjrL70XZxwcfpp+mM6r3C5XCxvFXoFf2AYhw6OpbC8QtcFxTy3M5fDK+TUjFdoTHgFZKpJrxBP9Fc/g7hcgsv+LV7BcQTjP899t0yGhjmJXUWuWvAKl+ELdt2x6yfPtPg4JwLeY5i1vILnN0KPEEZl+GxU42zf3KHgGxQnxt4gOvZnvAJe4bmFV0CmGvQK5/54mKaWVROQm1co8wru2gLjGUSGxuknaOY5vcLlEtik8csb3eVM+amQGn9yrEDCK3Sd37H+o4rRFszxmGFdajuv4FmLq/oKr1DWHnODQHgFZKo5r3Duj8fjMVwNuNzR+w/H1V5h+Pvw50RaYy547E8Js5OYTD215hWmuTX1vU7d9nc2Kz9veYXwUOwV3IMFXoG8Qk7NeIXGhFdApprzCqeuOy0p69Vp9tpnEF6pZS4ZnjIcsr2CyxxNg+Ga/eL0RDteIeUCvCc2c0RSQV35UUPGegV3cUSmVwjsBV5hFF7hyYVXQKaa8wqXi3sZP5/P4ZH5sm5PKFEVJXmF1B2s7RWsac9anTD/O1giuapWvMKp6/r+uLYS9dj33ZCGMb9zsFohUnfKyCtMHx/DEv/ZWtvoJaXwCqPwCk8uvAIy1bZXmJYumBdufzXhjV5h/tTBuklOegXfAzh1L/88dYdrfj7ZhleYlhYu4XJXLhjJlzVLsRTezCuYXmEp7v35ND4OCvIKlzCsAz1eAa/w3MIrIFNte4XxP+b8HVzMd/AK4yJ+o0F7Z6g1rzDUNS/TnNs2JkZbLXgFf5HHoKWrTK8wfKyPOzOs2OihTK/gEsx9HT+DiBZbTrHEKxTVjFdoTHgFZKp1r3DqDofD8XjcuPW83Ss4t/+XaFfA0LwE08mpO3h57VN3GA5cs5hvVAteYZSf4HGOprzCue/cLRBMr2DkG1bXK3ifHZeepB9RLQXD1RZ4BbzCcwuvgEy16xXs2/vDOCNH81O0hmBtIZ2vY98n5nW/NqcZg9f9r7PeLqVnWa9wib2CEyb/qPd/I8mymAC/j+eiwbKPS8lPHI2nH37Bzb2/0kr1FV6hrL3SQY+uUE4nP8hYZDwoldnbj+UVUFNeAeEV9mmPMScQXgGZwis0KrxCQ8Ir7NMeY04gvAIyhVdoVHiFhvTSXuH9NvFOarGGTr4xahoxHpR6qIFB6PN118Dt+07qcpXtmf34SgWLd1Jv6w55BePHMb4y1jGt7vkxKLF011r3Mv1MKq/NcGm1X/+tyunkogjmF/78+XNRzVeMB2uhdbTIzIiEPWL2D5n/u3KzlL9lQKrdcIV3Hw270vGS2dvXjY3Pnz8X1XxF6NMn/lYY53VwW0qGO17HuH2NSW46Xbiyfv+8ghssSV5h7ftGZ8wicRQyBsj1C1FfIq+wY3s3j7lB8yXDDG7yB06pDVjN6Gfs3br1kZWG4qJP4RU+ffr0ww8//Pbbb//8809O+avGw3gBSV4lwt85ne0d+zN2wrkiZBs/31s+GW0rZM+DqeF5OXXFwyWzt68bG58+ffrxxx/XQ3+FV/D7z+uj9BmzfS4t3Wp3e1jD+s7NZ+eFy3EL0bewDqyq9DTZjKAbrL28QnA+mr9MMUumr+CVo2DcHmQIr7BPe7d6hWCUbU/me3gFf5ZZuQid++Px6HzK+4n2sPup9022zoMrVdcrzB31/fff//rrr3///fdK+SvHQxgz65wPvYI9Me8aMuNqGV4W/d/5n6b38oYXS3s7AHda20ytGcrs7au9wmbo7+UVchMG16QfjF/x+zv6WJUOPvChvcLM+t133+1+njo/dDV6OOvLP0AU8Aote4VR5oXyeq9gXDWWwegUy8srzK8+PXad+2qSZWMzs5FsR7KpnE4WeIVZ33zzzS+//PLXX3/9999/V6AmtJq/D/8+XmriPV92DZl76Rk+6o+34X/+ndBxec2uX2z6p3c7Fu9IW6LM3r7dK6RCf0evsGri4r0UjbyCt5vS0s9RgJzZaXkw5G0LMLXVhlfY7Tz13v8ZPyqI91GaP2c9+nmAKOAV8Aqe8p5BTNNF8r0ozp2rM+fM+56HFjfGebq8Qqyvv/76559//vPPP//99998VE9j77uzdWIxQJRXcAJg5hxuD1noFbr0lOXeFCXzCkFS1Rn1yScTq8rs7R29QhD633//vTT066mA4NyeDvq3tdZkUeAVus5+qmSn1d2MkNfwyrfQ78V0n/M06MHgdAi9grdqZzkxLf9XKQqnbtwltOhcS/UVXqGsvdJBn9CpGx9AJ3xA1hqVZaRue4Vz33fD3q2GtY0/4qAN+9J6U5V7kixN7+0VHln/+9//fvrppz/++GP4b+H3i29LDvOri+3N24I93qI7+8s+IYvzCuEzBHeEzlfTRF7B+wKjiwlVNmLuGtNM3Rb6zPUKKWU/g/DvaP1D7izVdd4mgpclSP7zq7I0QqT7hWNdxcEa5vr5JIn9utMbiy84ORtlB2dI1SiM14RT4iUkCaX6Cq9Q1l7umFvTcovvH7/jM4hz3/X9eL9p2ONw4hmqXGae+UTwLfN4IJ4CbrmsXB5jSljXPl7Bu+2OImtdp9L3O7uELBhKjvUIHIfPaOcVhsREkBCxv2Cu7hPMMhWGPvKFRvJ6ywEM8vtqLb1sPCl3B5t1R3vqoibWs99l2jsIubriPB2CEZ2W1jOI+dhsEYKknaMqUfAv7YdMv5DqK7xCWXv5Y87WOFZKnkGkHgU7VW7lFaY0RmB5wyHv+WXnvvJ4nJ5QeBVM/48fru7iFdbLtPoM4nKJbt/dWJilvEgeuj51i3NzyIy8gvNhN7LzgXg5w8Wsd0evsFnskZ5B5HiFSMlflGRMj/6ZbtwbzNGI+39JSF2yst8336qmVOkZxCA/32vl16IePM0v9bSTdpWiEBrKvPULqb7CK5S1VzroE8r2CluLni45XmF4baqT5I6trWOe59emOQ/ZvInmMv/Nvtt8Oq/w7bff7rq2MXk37lzvnayPf1EIzvddQ5b0Cn4x88aqG9dIBFn22Sv04df1PU+WMnt7R68QhL58bWOuV9gwAm4o3ET3fCBlQYyTMZylliGWvqM1rjslT1BKT5PrvMIu5+mpO0RrCy7RszgvzzeFJO2cK0WBtY2v4xWiq8KVeQX3o06dYW57umA51y/nn8tMM7MNI7t/Uq9wt99MmjmEeD6YJwkvw3CYcgZewX1CtuEVXGsw3kZ5H4uyD+t5hfIrmcwr7PebyQKvEDwij87O6UP9/Hj7cOxP/TF65rcxS5lp9bFpx5n4QykO1cN4hT1/MznlFKKxGTwqWK6g3kOH1OPiSlHAK7yIV4jmD6PMXM5W7BW8wR1mqDeywsHT1eApW7LdctX1Cvffi2nKBcy3JIcpg3mIIz6VH/IQy/PR+PmEocKQBUPJGHxzynUYSqNFiO+ynKUOdl2WEd5UZm8/0l5MWw8Oyr3CJD+GwSy1cjK6WcRVNGPBStB+Va+w+15M/mQcXKJ9v259cSMFXDsKeIWmvcI0NMyHUF62KWtAXorzCl6TyUfkFvLKu9SfJq8g2ON57M8pKR/2lRH5eMws17VdQ2bmFaKnIOOnZx/SzT7HGWDd9Ewkyitssa0os7cfeI9nX9t5hd77ia03/UQJx7mcdUfrTklZNnPlgXoEv6nSvqqyx3NivVjQ38vjt/TsfnDOj7GWKlHAKzTtFdC2KnqF0sKMB6Xu6hVKCxP6fO3uFVzxnsl9hVfYpz3GnEB4BWQKr9Co8AoN6aW9wvtt4p3UYg2dfGPUNGI8KPVQA4PQ5+uugav9TupnUypYvJN6W+QVxCKvgEyRV2hU5BUa0kvnFXZsjzEnEF4BmcIrWCuHzcXNaz951espvYLbxctPldzFwYdI0ZZMkVYWK2aG/lbhFfZpj7lBILwCMoVXiPfDwiuovUJsAqaXs18u1l5p0cZZ0W+OXOV4hfXQ3yq8wj7tMTcIhFdApvAKqQmjC19q7O2dEm7YINdTeYVJwZaKqzO/tz2D6xW2t1dwX0qbFfpbhVfYpz3mBoHwCsjUi3sFf6cLf5Zx9trwdmE4Li+fwSvk15lWzuTubG0WPKXYyisE+5SEx7dCv4vwCvu0x9wgEF4BmXpxr3C5XKaby2W3T38bJ3tP1uRGQCo9kVeYFG3enbOh7nF5gbsxu5/74/BSid5+tJAZ+luFV9inPeYGgfAKyBReYfWh9bZXIK9wu6zEwrE/Oducpnv5fE56hSkJsbLbfmbobxVeYZ/2mBsEwisgU3iF1QnDvbfFK9wEsK5xXvc6dL7pj/rYP2Z6BedpRfw64FmZob9VeIV92mNuEAivgEzhFTYWuHlv3cArXA+wqvjNa4Nx6PpggUKUfxjWGvhe4ey/I9B/jYT3tCgz9LcKr7BPe8wNAuEVkCm8wjBhnPqj8zM9N/88zzTn/ji9VRavcKdg+R166hIz9mpeYcweJDdjcLdsyAz9rcIr7NMec4NAeAVkCq/gvE7S+pG9MxHN5cgrXAGQpWWCH14r6aw4DLZYsF8FW7a/QmbobxdeYZ/2mBsEwisgUy/uFfwn2PZC+kN3Ct9qjle439pG52Xhy2uog70W7N+flHmFzNDv8PXwCnu1x9wgEF4BmXpxrxAqfm4erIYzCtgF761n8gqL/McI06Gpc5fO9/t6tg52CPIeJWyG/gbhFfZpr/4F4gWEV0Cm8AqN6jm9wpPqpb3Cl0jv7+/xwRwx5gQaOnk9EEURvF9hxoNSOQPjruF2CxP6fGUG7rqg5Ihg5as0WKYe5PpcJPIK7Ym8AjJFXqFRkVdoSC+dV9ixPcacQHgFZAqv0KjwCg0Jr7BPe4w5gfAKyBReoVHhFRoSXmGf9hhzAuEVkCm8QqPCKzQkvMI+7R2QSptxzY8gXuE5lNnbeIVHU2lf4RUqKtVXeIWy9t7e3nSz5Qvr7e1tM675EcQrPIcyexuv8Ggq7Su8QkWl+gqvsH97j1AYZk1hrkFK4RUaFV6hIeEVdO09QmGYNYW5BimFV2hUeIWGhFfQtfcIhWHWFOYapBReoVHhFRoSXkHX3iMUhllTmGuQUniFRoVXaEh4BV17j1AYZk3hA5KrYrhjr4Dydaeg5HsFlK/r+vm6CD7OnIJXqIbx9MwfPnyofVK/ljZ/IHPXcLuF+UlUkT5+/HinoORcNwhWkcyzrMXrM15hWzBrCsOsKQyzpjDMmsIwawrjFbYFs6YwzJrCMGsKw6wpDLOmcLFXeL9Nw9sq2xLMGsGsEcwawawRzBqVMpNXqIYBswYDZg0GzBoMmDUYMAc14xWqYcCswYBZgwGzBgNmDQbMQc14hWoYMGswYNZgwKzBgFmDAXNQM16hGgbMGgyYNRgwazBg1mDAHNSMV6iGAbMGA2YNBswaDJg1GDAHNeMVqmHArMGAWYMBswYDZg0GzEHNeIVqGDBrMGDWYMCswYBZgwFzUDNeoRoGzBoMmDUYMGswYNZgwBzUjFeohgGzBgNmDQbMGgyYNRgwBzXjFaphwKzBgFmDAbMGA2YNBsxBzXiFahgwazBg1mDArMGAWYMBc1AzXqEaBswaDJg1GDBrMGDWYMAc1IxXqIYBswYDZg0GzBoMmDUYMAc14xWqYcCswYBZgwGzBgNmDQbMQc28k7oNwawRzBrBrBHMGr0CM3mFahgwazBg1mDArMGAWYMBc1AzXqEaBswaDJg1GDBrMGDWYMAc1IxXqIYBswYDZg0GzBoMmDUYMAc14xWqYcCswYBZgwGzBgNmDQbMQc14hWoYMGswYNZgwKzBgFmDAXNQM16hGgbMGgyYNRgwazBg1mDAHNSMV6iGAbMGA2YNBswaDJg1GDAHNeMVqmHArMGAWYMBswYDZg0GzEHNeIVqGDBrMGDWYMCswYBZgwFzUDNeoRoGzBoMmDUYMGswYNZgwBzUjFeohgGzBgNmDQbMGgyYNRgwBzXjFaphwKzBgFmDAbMGA2YNBsxBzXiFahgwazBg1mDArMGAWYMBc1AzXqEaBswaDJg1GDBrMGDWYMAc1Mw7qdsQzBrBrBHMGsGs0Sswk1eohgGzBgNmDQbMGgyYNRgwBzXjFaphwKzBgFmDAbMGA2YNBsxBzXiFahgwazBg1mDArMGAWYMBc1AzXqEaBswaDJg1GDBrMGDWYMAc1IxXqIYBswYDZg0GzBoMmDUYMAc1/x8pQ8TDcjNMAAAAAABJRU5ErkJggg==" alt="" />

  当然,你也可以将爬到的数据进行各种处理分析,例如你可以写一个比价网站,去各种网站获取报价,然后整合到一起。所以,将上面的布置拓展到所有的爬虫后:

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAApgAAAB2CAIAAADjgnRFAAAPKUlEQVR4nO2dPY7jRhCF90zjxIENGA7sA+gka2JhYAAnxsIH4EUsOHHixImTTXSJuYACRwbogCLZP0WpSTUfVZrvi3Y1T9KbqiZfd5MafegGzudzV8wi8dvb20avjOcQPGvEeNaI8awR41kj3tTzh9XPLBdTdI0YzxoxnjViPGvEeNaICfIdxHjWiPGsEeNZI8azRoznREyQ2+BZI8azRoxnjRjPGjGeEzFBboNnjRjPGjGeNWI8a8R4TsQEuQ2eNWI8a8R41ojxrBHjORET5DZ41ojxrBHjWSPGs0aM50RMkNvgWSPGs0aMZ40YzxoxnhMxQW6DZ40YzxoxnjViPGvEeE7EBLkNnjViPGvEeNaI8awR4zkRfzhvz9vbm+Bd6oJnDXjWgGcNeNaA5wRW5DZ41ojxrBHjWSPGs0aM50RMkNvgWSPGs0aMZ40YzxoxnhMxQW6DZ40YzxoxnjViPGvEeE7EBLkNnjViPGvEeNaI8awR4zkRE+Q2eNaI8awR41kjxrNGjOdETJDb4FkjxrNGjGeNGM8aMZ4TMUFug2eNGM8aMZ41YjxrxHhOxAS5DZ41YjxrxHjWiPGsEeM5ERPkNnjWiPGsEeNZI8azRoznREyQ2+BZI8azRoxnjRjPGjGeEzFBboNnjRjPGjGeNWI8a8R4TsQEuQ2eNWI8a8R41ojxrBHjORET5DZ41ojxrBHjWSPGs0aM50RMkNvgWSPGs0aMZ40YzxoxnhPxh7eB8/n8Vswi8SK2s4FnjQ08a2zgWWMDzxobeL7HBityGzxrxB8/fnwBIa+vrzu2OxS/vr7uXQxPfPr0aaOmlJw3aNYizKNs0/MzQW6DZ4147yPuPbJju0Px3mXwx0ZNKTlv7P2r+2Ndndd1kCCfBc8a8dyghy0orLYyyMuf+55ZWquNgrz8Nd8zBPm2NvCssUGQPywEuVMIckcQ5NvawLPGBkH+sBDkTiHIHfEugpybJjTcvKeJIH+HFFabIH80ltaKIN+RuVo9VZDrouzdU7GDBPlzUFhtgvzRWForgnxH5mr1hEFe/lxYAUEOJgS5UwhyRxDkUAeCHEwIcqcQ5I4gyKEOBDmYEOROIcgdQZBDHQhyMCHInUKQO4IghzoQ5GBCkDuFIHcEQQ51IMjBhCB3CkHuCIIc6kCQgwlB7hSC3BH7BPl5e/rvWethQAjoiyzo7P0wHpQ81MCg9eVs2rjw/Eyz7meuWSV1Xg0r8ieEFTmYsCJ3CityR7C1DnUgyMGEIHcKQe4IghzqQJCDCUHuFILcEQQ51IEgBxOC3CkEuSMIcqgDQQ4mBLlTCHJHEORQB4IcTAhypxDkjiDIoQ4EOZgQ5E4hyB1BkEMdCHIwIcidQpA7giCHOhDkYEKQO4UgdwRBDnUgyMGEIHcKQe4IghzqQJCDCUHuFILcEQQ51IEgBxOC3CkEuSMIcqgDQQ4mBLlTCHJHEORQB4IcTAhypxDkjtgnyM/bw9eYiumLLOjs/TAelDzUwKD15WzaOL7GtC5zzeJrTHOOzaE9XROc2sPLDZrjTWn0JsdmeMqxmXm1m+86vGd3bF7S32B6/XspKfJG7f7y5cuiV14xHk7tIavTqT3E5TQ6YY+Y+i07tYf40Vw1PnLl3ZtjPMwObZsNu6XjpbDa68bGly9fFr3yitZfJat79kDSist/srEzjrBEYHV7OBFZp4S5IbeCpbW62cGwWbusyNNiZl2IKtq01oEynHOdNIut9ZwxyM2ehC1JGmr01wqGy+MzQT5Lfk6YfaNc+hRB/vnz5x9++OG33377559/SvSrxsMlUmdTsC/sVOH+X9lQGc4DdVuWquw3ODYvL80x/qE9O50bnt2xWTxcCqu9bmx8/vz5xx9/vN76DYM8aW86DxpGxTQdG5pnNWgcMU0bjqK84FM2zLxGnV9uaa1udjBs1h5Bbs287cNknPEm/5iOPS/NIsgDkpP37aStEeTxKeHK6fPUHg6H4FnhSfqlaaM3Kli4rWXfIB8L9f333//6669///33Ff3K8ZD2bP5kHPzLTM2qLUtG52WFkM8eRlPH9nBoT8aMJJZZUX9zU8pg6yC/2frNgrzvRNC/S92MKvXSmVngzAQxb1I8UWgaZyvysFnffffdVsfpHPlmye0gz8trHV8P3CyCPMc8i60PcquvaW+NjbqMYW44xEbThNPHS/qk7ze8SfF04SYlRRYE+cg333zzyy+//PXXX//9998KqzNc3ZZOf37ZbhsbGp4h6rUsPBv1T43H25ghcdxc2fuJphbh9GXVtK+w2vcH+VzrNwryqLGH9tR1p7aZSmwUKo/8lL7n4YQqWgAm7xw4GH7+6Cty1XFqER04V6pVtCLvnDSLIM+pHOQFW+vDuTzfsguGY7D0uwTCuFLoT+DpG4V2nm5FnvP111///PPPf/7557///ltuNeJS/TBKZ66DZity8/iu2rI0yJvk54G18PQ1uyIfdcPTk2nCUgqrXTHIk9b//vvv61t/N8nYuWCdw4/Ny7Rya45dd2wPwYgJr7PPZkNNltZqXZDXPE6vEhw3wbFp9KYoyL00iyDPOTaXi54zIT0TtnNny9tBfmrbpnlpmuiGp/kLPYG1Y/MyTTqTMZgMs7pB/sh89dVXP/300x9//NH/d+Hvl9S9L/ZYvmDTfHrDsS3pZdGqLctX5OmSIByh4+J8ZkUe/QKXKUbKshGzaU8Lua/1c/TVHq+WWgQ7MMcrdzj2M/Xg/qlhtTeu9Q/R3Gp6jkWllLin4PewSbNO7SGciYd3tAQr63D0z3RsOCYesln54wR5wrQ4jh/fcGv91DZte1mpRVud/XkjTYVop2f8+ZQB05naPj/fOaKuH5yPQJ0gjxasWWfzFbmxt123ZcY18i6cIsx4tFfk/ZI+2Uqwf8FStmnmMrYP8ulR+2jPJ1b2Lc7JYZi8xdC0LDC2W+Q9R7NO7eFwOEzzqvASV9d1UeVLt9YfsFn54wR5wOXcvWRrveDQvrUiHzYA4uVUuC0bJsapPUSXVw+Hw7DxHr3A8P/U05rbmCJKiux1a73rsiMx+tiJpYpPEsG9rV3llhkr8uDJYWfHB/JL6J35uhWD/KbM4dZ6dOK2pubWFfFwNyZIh3yWNmVN06RThbHxTTN/y+K9LK3VI2+tH5tpEyq99TzXpjVMDwVHzSLIc4qDPD/jrgjyU9se46M93A6awqR/oG3GjdxxKEUp0I0/s9dpTxfk3377bdWbaGbXsQNBqka3q14yOCxv1ZbNBnksy6eAp+nOuKj7QZBnn6SNJyRFFFa7YpAnrd/srvU0yG9fSumOzRjvqW48oLu41fkdGZcGZfPDbv68soaltVoX5LWP06scm5fp4yK3gnzmWB9/5qJZBHlOYZCHPZ7TlF0jn54avGa6ZTuMt+Hlglc+hTdEj9760GmfNMg3+/iZufqOj9rxyI9X66f28DKstiNhnZbdCPIwty/rkehp2br9+oo820+8iSzI5R8/64l3yswg74uWjJcpJ/pyH4/TNDHaiI8f6IYGDZM+n0G+w8fPui6aSHVXgnxK7EGXXJ1y0yyCPKckyLOTu6EZdTZ5kEdTg3Tj9cZmZ3IPU3h199r7LmffIN/+D8IMq+hgsn45Ql/yjg/6/mifLpnl2+4GC1uWDCVj8AUBM5oI/7rR8CsFl9ft17JmqTcprLbLPwhj7JR1xya6m7lfpDX5Bx1yhlHVNi/DDQvjZxjjvOhvxGrMT6Y/dJDv+QdhhtXysZmtUj/0s2PTSlwnzSLIQ4YOWNPp+O4HszP3r8ijt5y9LGtZngvoZ9paF/yJ1ks9h73mtFZG5/MxM2V51ZaZK/Jsc//y7HGS0IyTkGCANcNWf7Yiv+XtCpsG+b5/otXsY3ana7AxM8flGnn8POtPSgwtG3ZpXa3Id/wTrdkE1LqJPBLMt2sK7+jVHrNZBDksY8cgXypmPCjZNMiXiml9OdWDPIRvP6sLQQ51IMjBhCB3CkHuiH2C/Lw9fI2pmL7Igs7eD+NByUMNDFpfzqaN42tM6zLXLL7GFJbBihxMWJE7hRW5I9hahzoQ5GBCkDuFIHcEQQ51IMjBhCB3CkHuCIIc6kCQgwlB7hSC3BEEOdSBIAcTgtwpBLkjCHKoA0EOJgS5UwhyRxDkUAeCHEwIcqcQ5I4gyKEOBDmYEOROIcgdQZBDHQhyMCHInUKQO4IghzoQ5GBCkDuFIHcEQQ51IMjBhCB3CkHuCIIc6kCQgwlB7hSC3BEEOdSBIAcTgtwpBLkjCHKoA0EOJgS5UwhyR+wT5G8D5/P5rZhF4hAGhIC+yBU7uJ2Y8aCkZGBs2u5QTOvLKWzcuqaUQLPKWdosk6WHFSvyJ4QVOZiwIncKK3JHsLUOdSDIwYQgdwpB7giCHOpAkIMJQe4UgtwRBDnUgSAHE4LcKQS5I95RkIOAih0kyJ+DwmoT5I/G0loR5DsyV6unCvLX11ddlL1jXl9fK3aQIH8OCqtNkD8aS2tFkO/IXK2eKsjrvjKeQzx65gShhCB3CkHuCIJ8Wxt41tggyB8WgtwpBLkjCPJtbeBZY4Mgf1gIcqcQ5I4gyLe1gWeNDYL8YSHInUKQO4Ig39YGnjU2VgQ5KNmx3XmQQzkbNaU8yKGcdXVe10GCfBY8a8QfP37c+4h7X9z8OMOm7Q7FfIBlEZ8+fdqoKSXnDZq1CPMoI8h3EONZI8azRoxnjRjPGjGeEzFBboNnjRjPGjGeNWI8a8R4TsQfztvTf8+aL/CsAc8a8KwBzxrwnMCK3AbPGjGeNWI8a8R41ojxnIgJchs8a8R41ojxrBHjWSPGcyImyG3wrBHjWSPGs0aMZ40Yz4mYILfBs0aMZ40YzxoxnjViPCdigtwGzxoxnjViPGvEeNaI8ZyICXIbPGvEeNaI8awR41kjxnMiJsht8KwR41kjxrNGjGeNGM+JmCC3wbNGjGeNGM8aMZ41YjwnYoLcBs8aMZ41YjxrxHjWiPGciAlyGzxrxHjWiPGsEeNZI8ZzIibIbfCsEeNZI8azRoxnjRjPiZggt8GzRoxnjRjPGjGeNWI8J2KC3AbPGjGeNWI8a8R41ojxnIgJchs8a8R41ojxrBHjWSPGcyLma0xt8KwBzxrwrAHPGvCcwIrcBs8aMZ41YjxrxHjWiPGciAlyGzxrxHjWiPGsEeNZI8ZzIv4fOHlX2k32iZUAAAAASUVORK5CYII=" alt="" />

  下面我们按照这个通用的布置,学习每一步都应该怎么做。


1.打开目标网站

urllib.urlopen(url[, data[, proxies[, context]]])

  去远程请求响应的 url,并返回一个类文件对象。(注意,此处已经发起了远程请求,也就是进行了联网操作,有数量流量)

  url :  一个完整的远程资源路径,一般是一个网站。(注意,要包含协议头,例如:http://www.baidu.com/,此处的 http:// 不能省略)

      如果该URL没有指明协议类型,或者其协议标识符为file:,则该函数会打开本地文件。如果无法打开远程地址,将触发 IOError 异常。

  data : 如果使用的是 http:// 协议,这是一个可选的参数,用于指定一个 POST 请求(默认使用的是 GET 方法)。这个参数必须使用标准的 application/x-www-form-urlencoded 格式。我们可以使用  urlencode() 方法来快速生成。

  proxies : 设置代理,有需要的参照官方文档。下面给出官网的例子:

# Use http://www.someproxy.com:3128 for HTTP proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)
# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})
# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)

  context :  在用 HTTPS 连接时,这个参数要设置为  ssl.SSLContext 的实例,用于配置 SSL 。

  一般而言,我们只需要设置 url 参数就可以了。

  例如:

f = urllib.urlopen('http://www.baidu.com/')

  这样我能就能够得到一个类文件对象了,然后就可以对这个类文件对象进行各种读取操作了。


2.操作类文件对象

  下面的方法和文件操作中的一致,下面列举出来,详情请参考我在python文件操作中的解释:

1. read([size]) -> read at most size bytes, returned as a string.

  读取整个文件,将读取结果返回一个字符串对象

2. readline([size]) -> next line from the file, as a string.

  读取一行,将读取结果返回一个字符串对象

3. readlines([size]) -> list of strings, each a line from the file.

  读取整个文件,将每一行封装成列表中的元素,返回一个列表。

4. readinto() -> Undocumented. Don't use this; it may go away.

  一个可以无视的将要废弃的方法

5. close() -> None or (perhaps) an integer. Close the file.

  关闭文件,返回None或者一个表示关闭状态的整数。

另外:也可以和文件对象一样直接进行迭代操作。

 

除了以上和文件操作中用法一样的方法之外,还有以下特殊的方法:

1. info()

  返回文件信息,对于 http 协议来说,返回的是响应报文中的报文头。

例子:

f = urllib.urlopen('http://www.so.com/')
print f.info()

aaarticlea/png;base64," alt="" />

2. geturl()

  返回当前页面的真正的 url ,针对于网站服务器在进行重定向以后,我们可以用它来获取重定向后的页面。

3. getcode()

  返回当前请求的状态码,如成功请求的状态码就是 200 。如果不是使用 http 协议打开的,就返回 None。


3.操作读取后的网页源码

  这个部分就和文件操作一样了,但多数情况下会和 re 模块配合起来进行数据筛选,例如:

f = urllib.urlopen('http://www.baidu.com/')
b = f.read()
p = re.compile(r'<img.*?src="//(.*?\.(?:jpg|gif|png))".*?>', re.I)
result = p.findall(b)
print result

 aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAe0AAAAdCAIAAAAIM6MBAAAGwUlEQVR4nO2cPa7bOBDH5xR7jV0gBMJbpNsUT611jATQAdRtlTa+warfKrrBFtEBwgcXLlaAgLiYLUhR/BhJtEX7Pb3MD4YhM8PhDB/9NzWmA/gI/i3hE8DnsnnIaAk05SeAzyD/6UzDq4uQ+TWJVmYKZvUC/FVf0+3X4c2/3+GlA2AYhmE2wTrOMAyzb1jHGYZh9g3rOMMwzL7JoOPPDDPD9tXFMMwqrOPMHdm+uhiGWYV1nLkj21cXwzCrJOl4UwIAAMi606+8Y5cvrRXM6yVeLWnc1oth7k20MrtaanV80QWbpuO1+/OCppTerw0yvu3b6gKAojqlWMZm6jgkds+CUqdKYHFUc/Fc5efD08+54BfyUm0vAB+Z9VXEqyWN23ol0dUIgLJOsiTMmtTuuaglaomg47nGz5+HF8j9urA7lPDoGQ5DoAJuSgRAeN98pFdmV9d70vGulmEeed/5x8Io45rZpWoJs8TuWWirCxTn5Xiu8rMQ/HJeubJWxyHj7CnVxqslAWKN5aUpMWXz1JRIhpHYPQtdjVCuxHOVn8fnfkPYWWa4BAT9KNeNg9GDgMfZ6w6/SfmF7LQzHSfI9bZ/NjvT2wVxY/frxmp7AcNRbR3L+lkIfjmvLFmP+/rMs5drmeallrcL4vbuV9ChBMwgD46fXeS+fZRaTvNWQgZvax5Yxx2UOhcbxHFj92sGmioqufwsBL+cV66s7zF7uZZpXspt4rixezq2opLRzy5yzziKe0NzMwnx7F/HVdsLuDw9XQCGqrqA6L+1/622tKRgtb0Ql2BjeCz0/ZFpMS99D6aw/jQI0X/7caqENrAXMwp4HPSdl1XksWU4KjWOi+JpEP5G1a+oePHM9bLtrrHrh8zdzWs2i7YP/tXNYm702Cb2441iJnOK0GzhxfAkyL8XAiBIzLN761ACHg4IgHWNIPF7Qgs9dIdSmiKsXdam+jm2mJe+B1NYP6CU+B2xlsbAXtA0xvOkyGNL44wrDyj9bWNQUXHjmetl211jT8gem7vu5Rp4Ef6Of4zjepYdysCnM2OzmUY2es0s19kDP1+i3MkppXPdvY6rcyH6Vp2L4twez61SKS3zwjoclbJfG6rjYC8myXOuzUvRt0pZWbQGgWUof7qgYY1HITsWKKqTbjfO3dFJ3fRjC3qR8QR+yNyDvGYnzZ0NPwty9Nhmea6WItSaPpNprmWKiKXEDrEssWvM2ymlhVrN5t0+fZ3VTBeT5DX+Pq4xb+NJFpvoIsYWNKzNKFJNibIe27XzJhTcUMuC2IJeZDyBn0fmPhe2f2E+3jo81FSXeMZIP7GNdbZQaieziDNazRHxbej4qRJD5Rw1SWkhsQIxKYU6F+Pnoe0YHA6Je8UXC2PN+Ymfx0TCIrLraqaXzcLsf2M/t2WRMhvx6KTn1SM3KREGY+VapqhrlP5Rk5SWGCth7rEE+52YbQkOLcS9SD9zY835iZ+nRKKv2oLY4l42C6sogZ9H5k66iiPUH731YfqwWR099rMQz3KEcTzXepis9q7jz/pWuugr4d/Ir7XE6HMXrsaZlrYXTvUjOJ5hts+ODdkrHmvcVPZ6LC1GSp0LuFSt0k7c5zmxc+Mhe8XnSWI/ZO5BXguTFni2WSyM7tqQZuRApOe5TMnVdNvZ8KZEKL3vr1JaSD96eKtxpqVD6VQ/gn2c2eg5NmSveCy7hdRjWVHQX75pJ+4zzgiHGw/ZK954xn4emftc2LEB+AWlwCyYseVM7awah11YrZqLbaEx7fDMm9DxtuqP6lRV0135aku8IVXqVFVnXV31bvYBQfRVgQDD168DRNvz0WZ4En7xd2z59oPYRNuT116hYyyXaw9F1Yvx2UgqXYk28ZC9CLPYz1LuU16BsLpuySzIIGObaSq8+6foryMuQozzo8vlgKaIZIvm/ljkarrtbHhXY4NY19e1xBvbujbaMWlcY6qfdYkA+PdYaSVtDtIvyI4t38lTDeNpaK/QYcvlDQJgWaMcn41czlSHTTxkr9gs9nO33Bf+mt5ZkTnnJWFAz9icn8DGMbCjE38damJJt148NLvQce/3nCH05u3NEexer+w4llPa/t274QP8fMzhyHj0hW8y1/2s9Y3HitbL3c+G752bD8lNZyo6fP8eP24+bJeBxlN5N0L7GbN880RC+rmBPH529HvOFRfAD37MPrLw4lnwgx/68Trh/yeLuSPbVxfDMKuwjjN3ZPvqYhhmFdZx5o5sX10Mw6zCOs7cke2ri2GYVV5r3Z5hGIZJg3WcYRhm37COMwzD7BvWcYZhmH3DOs4wDLNv/gfcRGQ6c2RK4gAAAABJRU5ErkJggg==" alt="" />

  尝试在百度的首页爬取图片的地址,当然还有很多改进的地方,这里中做演示用。

  关于python中的正则 re 模块的使用这里就不再重复了。


4.下载相应资源

  在我们通过正则筛选出我们相要的图片的 url 之后,就可以开始下载了,在 urllib 中提供了相应的下载函数。

1. urllib.urlretrieve(url[, filename[, reporthook[, data]]])

  将给定的 url 下载为本地文件,如果 url 指向的是本地的文件,或者是一个有效的缓存对象,那么将不会下载(注意,这里的存在是指下载的目录里有相同的文件了)。返回一个元祖(filename, headers),其中filename值的是本地保存的文件名,header指的是上面 info() 方法返回的对象。

  url : 目标 url 。

  filename : 下载到本地后保存的文件名, 可以是决对路径或相对路径形式。如果没有给,将缓存到一个临时文件夹中。

  reporthook:一个回调函数,方法会在连接建立时和下载完成时调用这个函数。同时会向函数传递三个参数:1.目前为止下载了多少数据块;2.数据块的大小(单位是字节);3.文件的总大小;

  data:如果使用的是 http:// 协议,这是一个可选的参数,用于指定一个 POST 请求(默认使用的是 GET 方法)。这个参数必须使用标准的 application/x-www-form-urlencoded 格式。我们可以使用 urlencode()来快速生成。

2. urllib.urlcleanup()

  Clear the cache that may have been built up by previous calls to urlretrieve().

  清除先前由 urlretrieve() 建立的缓存


其他常用模块内置方法:

1. urllib.quote(string[, safe])

  将 string 编码成 url 格式,safe 指定不受影响的字符(默认 safe='/')。

  因为 url 的编码和我们常用的 ASCII 并不同,例如我们常用的空格在 ASCII 中: 'scolia good' 直接现实一个空白字符,而在 url 中: 'scolia%20good' ,原本的空白字符变成了 %20 。

  其转换规则为:

    逗号,下划线,句号,斜线和字母数字这类符号是不需要转化。其他的则均需要转换。另外, 那些不被允许的字符前边会被加上百分号(%)同时转换成 16 进制,例如:“%xx”,“xx”代表这个字母的 ASCII 码的十六进制值

例子:

f = urllib.quote('scolia good')
print f

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAV4AAAAgCAIAAADVFodIAAAFxElEQVR4nO2dzZXbOAzH0dWyhG3EKEOuILz5lO1A18EtB1ehw6oDN8E98BsAJdljeyZZ/F6en60hQfDrD5DjJBA2Wf3F+VsIIYQF3XXdLxZSYZgphLBeXXwzrHhtfppr3cHNn7JXNEOqvvjsM2VzwsOjfNrDF3JwdhI0QypTevGy7tS24hqYUGnmTYPZz+Bra4XQ932fr19Rek9phu1q3l187mVjQvRnvbpuOA5KA7PzwDC1VXj1rs/cwwfsP+bh67gdnZ1UuOzPV0tD21YINAPOiIso9p7BfKyVh33r+/7Chp7FcLXsSMNxvwnbEVkQJoAzwBmpfkSKsf0MMH+kN/FPXaneXwDONcKvVwcTQNwD0Y4oEwuKzMX3BVoPCSeAGXHKdt7hYYdS6+LXVJ2UMpJ7VlUXyhRpIEzzxaJ9aj2PCaWH41q8rTjsN+8aV6M1d3Ew//QXgItzE7j55KY0YqW56rAYVV5G86efwY9BL1Z/adYqr6XMxcBO3/cFYUKcAc6I83hOF+zWoViZo7bk2pADIsukJ2eISq32NPfiadIQaIYaGeSqrRkveblGy8cpb9T4PAfGGvNlmRDU8wJd+Y7iHk5IveVXethTAn5pNG6e5nyklGEclwYWyqQ0SFOy9QVhAjU33Gor2SHM274dMZg/woJJE+PrKAdUR/VITqfYGfRiw7I+y+Lg0PWdMHZ8pnWhdWtU+TrUnvRtSX+k2+oam3KoHubapdgdB4qw3jYT8u328u6lWVuj7ONdb9plVyHcXiWKndd5qI3VJNQ6bo/tMi23o7PDT7+jzcCiGWt9tPFE7OJtscyia/0jfWxfmZPbPosIxPOsA/JRg+fx1gd2+uzJ+WuXBm6N6l0LafQjOaFHZrB/mHuxIw3dRZdyXOxoMvbxAUbtCS2DsVDVuiuz+kvZyXvnbeah1N2XeKjR7Oq0pXPWUBNvWYZzbHbk6bfpe3a1XNzyftXWlQWq1erbWq+nrNoxilbLNMOTpUH6E8YzKCx392Ks1sYsb4/zhJ4tMDmq0vKRto5nDe0aSx4qO6X2NJXZkYa8W876hmGkjD1XkWkwzX14zyVxqSKHS3dqUs7kbZl//q7SeNaO06qHscDFuXy/8EIPy+rv6U59JWNctLuGM2xu+53Z0S7M+em62mnaaltvDqXN+IhaesrQ3cJQyFEaZ4QJ3I98Gk+vzQGb31V1o8rLDHrRzqDei3L3MZ9cydvbeW9Ho71r6O1o40w4U7gRNW3xOV28X3y3DvuVOfJZv4fqp16un2xNmUQxg7vScBcsJeZoSf6bKR6O0oov9/C53Hth/ru09d14Yt83ks099r4rsEfXi+dKw4il3v9/D9JVdpdlfC8Pjf8nYmUeoVwrbAXme3mPNBiG8Zth0mAYhoJJg2EYCiYNhmEomDQYhqFg0mAYhoJJg2EYCiYNhmEovEQaVu8AAIB9gYiQP/lyoqffzSvD+HpelzVIIVj9aevLWnGbOr+GEMj7NVSNabZvfuSe9b2vbyhYhvH1vFMatlm9QwqEzq+BiEInJYRRHGKh+ORJ4mDSYBgK30caYnlCQO89cQuEgB9h9c5VtXiONpg0GIbCvjQQ5py+bMWS5zf/UoLjWX6/5ZKV/u/yd2bitid0Tp4YsiRwsSDuYmOM2ZBPYi3nnEmDYQh2pUG7Mkibqfwo79w2qCsV2ZPyMb0ptwjOOU/VVGNVlQb5Rvqz+cSkwTAERw4UKbw25/7Rnh9E9dGTGu1rLe99OVlQCIQs9CsHipJ9CAW5641hGJX9rMHz/Vbj7JouBT6XNbRxm4ia+8hf5dIxt6VdQ6p5zeGsIV9wGobRsisNyu8PS7SvOtCf5Js6+VFzHdD8DwUxHTmdHIDz/xapIQSAv378bK8QuG12rcGMH79rQET7ZoNhCP6Ab0Nun2IMw3iEP0AatCTFMIzP8SdIg2EYT8ekwTAMBZMGwzAUTBoMw1D4D4jGmD9arv4IAAAAAElFTkSuQmCC" alt="" />

2. urllib.quote_plus(string[, safe])

  和上面的几乎一样,只是不转换空白字符,而将其用 + 号替换。

例子:

f = urllib.quote_plus('scolia good')
print f

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAX0AAAAqCAIAAADQ5xz9AAAGLUlEQVR4nO2dwZXjLAzHKYsStpFQhlPBcsvp2w58He6pwhd3kCb8HQw2EgLjxCbJ7v/39s1MPCCEkIWQ/WbVBAAAbVG7e4x3re/jq+MORnVK9e5VORKfr2EbXK8OsMMhYz2sLhtzMNumrmnzDmrtfLhHHWqQ87xFkszizuwfV6WuygwZMQ+rbzYS40zooq5mhxmestp416qjY0U6++lxDZ/l0xy9ZnVIY+P86hg3jfZW1/FJxcJy5IxWNubDXjadvqZNe+K5b3KsRz2sPkrarlkcIJnEndHetH1M0zRNg8kHv6jZ5BvP1hzvumjW0d6j3z6xBpHnuT5sHYMNOrsgLtGwlpc1PJHK1fGsm8wyi9OmE29o410r0YMbGZOu4Lm9pmlvmvB+j5JnSmdxpA0zkkncidOESEpiLH6QqYw7TM4TaxB34d3JtJ88ar2u4Xk8alfHN04TkJOmQzc01yvTGyGramPM50Z5Wre9acLbPWrTW3JtDpZM4k79eM4wcfHZx380bs5Krkr1P/6H+d96G1h7U+q65ib+GDXfYP48zNvMDZOcy9IGsYbOdEr1xnRBTgsNCUKvmx3pgZ+0SdnjDWSTEeLOejSmeYofPdjE+Yv5Xnys2ew0/5+l6ZtW/X/2ptRN607p/qI7b7FluFXhxKq8jaQPXcGfzCz8eXM5pwvrXmUNOvfBqM6YXqmrMX1+TQdD/DDxzNxYqW+kBknbLEWJeRsQZ8pWcNMaovful/xk3JlcH1UK0ltiPQg4m94Ay8cuRIH5etjS12wlbTNN4jHK3fntyjXsjKOSz9SQsqQqy6DznRkXLNI2jPq4k9tkColPOvpgVKfErLY0lpfjTIgpscVU/zMNxrvs/DWXvYpWrc*mZWRQky6ucnKfI3J2ZJ967cXBjyarcD6UrdKxUn1Rt0ce6kAdkTwlSypaxRsl7d0vOnrOm8VE8p5QnE0KD6/P5W+F+KN0qkU+vOFN2QUHOeRpKtuqS3WC+98ptYh61q8MrDrk7je3DbPTcXZ3sunwslhOR0X/8x/grU7Ksc7K98QyxIjatG2/96Bk5NO/T9k6fWBWsusuRcr9KF7RmBelFoT4lxe4N790tOV9X3nrwER1ksjuhEs3khoyhxX2GtBntbQkTWzUOpmG6Y5yioUQUMny8CDvGeh5J23DqVkfcZNaMY1Z1qcTzea2jC94v9aJjjfdL2BLm/X+V7Hp1cNxJ9ZnyK5hIJrVI1quwymU7d8YyB0utmkquGas+34l9zGso3CnrTNP6VMYaU8F7d0vOPUffTGiXg0zokp4OXE8Tk9DSDGsQNQM5qQonybjNn19r6L1KJQxRw7nBTetQ0zlRw+XWopCT9pJID1J9p/wGw9bqSI9XeEVDfB4fjx4dxSP7JL3kZIdUvtwU8gvTG9Up/TtUQPzXqKjB64PEqrxNZhbxCsqzWOpN/UV30YsXnWSNuL5D5Uh2dqZ308O5aCy+poO1gyV+SD0zp7Nc+6NLn/pPkCYsoryCUhsuOeO9OyXvf2+QTLv0mox09mnMomEuIXq7hsdy3lsY7x3r0zhw7oU0eYut11a2aP3OTswrcSfHsD4t+gz8gw+SH32WhuDfJPHMGpaCyyEvx76HM+IOAACUQNwBALQGcQcA0BrEHQBAaxB3AACtQdwBALQGcQcA0BrEHQBAaw6IO6PVSiml2Ft4zvArpzJrgTcBAfgCjsp30igz2kvj1ykbRzoAwJOcF3fO7niqKADAiSDuAABaw+OOM8qjwykp1G9U9JdvNGkx8XveS6F/m4WLWYdaL5NrxoljpVfmXlprxB0AvgEWd6QyjV5CQvgrQlrbcf2e6ciuLB9z18U26VjFK4g7AHwD6TnLJw8hohQCihgy0mZUMM+DSLMlJwrDp2PV/AAA+Gh4vmP5Db9mEaO1B+Q7paxEzLaq8x1n8BwdgK+AxZ0l4VAklaEFH1ZhifqES1GZJvqfleZE6nLRtDQUk3asr+8YY/AGDwDfwEe9r1w+uAEA/hI+Ku5IqRMA4K/js+IOAOBfAHEHANAaxB0AQGsQdwAArUHcAQC0BnEHANAaxB0AQGsQdwAArUHcAQC0BnEHANAaxB0AQGv+B9UF/i/pUdqZAAAAAElFTkSuQmCC" alt="" />

3. urllib.unquote(string)

  将 url 重新转码回去,是上面操作的urllib.quote(string[, safe])逆运算。

4. unquote_plus(string)

  同上,是urllib.quote_plus(string[, safe])的逆运算。

5. urllib.urlencode(query[, doseq])

  将一个python的字典快速转换为一个请求的格式,用于上面的 data 属性。

例子:

aDict = {'name': 'Georgina Garcia', 'hmdir': '~ggarcia'}
print urllib.urlencode(aDict)

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAbwAAAAnCAIAAAD7BGq8AAAIIklEQVR4nO2dz73jJhDHp4ock2NaoIStIB2YMuQKws2npANdzd1V6LB04CaUA/+ZAYEjyd735nvwx5ZhGAb4acB6u7C2eCoxAVwBriCXepmbMvGzlr4KXKVums9YJMwDxS3mIWDK20p8Fg9DefgqL3l4ID2jkxWW2o2O1KtRt76KLzrmh6MWtHYwn+ry2BqxnjLnk/Z9k31n1FOJvawN9eL9lrU8dWFC4zujbkI913Vd10WK6gRNiq2usJ0K5iGac8KoR/LtCxMoWTZ6Bld9Ud5n7c0hD3v53x4eSOfoOPTsbyGhF4d1J7bl7mrUIjkpmPkIHltrXfO+b/P+GUX3NO/FnjHc3bKeAa5wbhhbopkmaImjaKTNQ2QTpVM0CzsvTKC0Slk9i2zp4Qv2X/PwOJ69o+MK49TvoO7kqYSeQc6SyGfPCeZrrbzs22ga9fYZtTlbamU+yvLZYWyJZr8rWhaxSLfM7qPU8bZwd2+uyV1ikTArdQO4xqzQ7b6tOlg7qIwtiLJdlRdIPdRyApilnLydMzzMIGrdlHHVNVEGMzJRsts7IZrxRCXPEF3r4WbuLtZrlW3ZsOfbRmtN3ATM/6gbwE2ICcR8EZOLWJk7UFGl8ovSn3wE75VeuGOKcLxDjHtXNPK+LxImKWeAq5RzfUwXmc1DNDNrbeG5gQOCy4SzLHsPI3tajOBmNMjZu4/l7TFtJgrFxNhY70Oau49ornpOTseoW4EPllZ49YaPk++Sve6TqZgn4jLrSu6+9aPUmtLDSerc8pEe5oQkMTRqZSU9pMNlCvpFs3Z7b6ScuPVFwgTkfqLVlrOjpRfENGIw39dFuvVmX2vLgIxqzz6AsFPpRcMyPcpoG571XUvb8VmbRZtWVMt5SF3J28L+YLfJOTb5JKYhNzhZrkSjNXv3stwzprViRYGe9d5L7/Z8Nc+m6fZIeF3Tcz05byzm1jpPFmSEOhgmPYwXj/OQitVE3S3TpJIsk/LsHZ3ylK0mE0UGVLRem74o3ynbKrLRrPW7+5i+Fk62fcaLp8jNO4Q1Ji/9rVfs5Bm3UI9s69CK6tBEqn2FB7RnBPOLxJksdePZmL37Wm6Paa1rZIH2eu+l+4egrZ9Zk/1vNQcBcoz1Upkl5B0+K2PULfR561yv8BDfqw/xkCLROyd2/l4dt7G4TEnf6JC395jrWVfDT2dlv2LrxEylauVtmcfF389s5hUt6xl2Fk3sz1ofQWQ5O38vajVGuR3nSapiguGoYss9bfVnmukccx4SKyX2FJ/JVqKxNmbvXpZ7xpQMEW6oZ7330hLN5KGWzX1Q2P/6KnhTqec8JfQl5RJvMnLJThmI05O0zL8/4q3pSh3bkR7aAjch/LnGgR4GXcjJTpcWv/9aqDPN9sNeW6ND/ZhbnuKRjy6lrScHSUl8UC06zcxOe/XqMzs5S5hA/O1P/dxrcpBXnolnUS3LVHqRjiDdi3DGOl/ElDyjNlHRSM80cztUnLWc9frUOmmrHNNFqUVl8zCfmTWf6fPufOjx/PHWiEGkR5AqU1quzN4dLK/bY7quaLagOOCoBsr13ktbNIfAKXrGyc9SUQQPa6no2z3cl+OetntvW5/Gjn3fTJrqbD3ht8Uv9mzmCDuv9x1Fs8YSf6v6DNxPcllm+lkeMt8TNDN7CEeBu/wFxxdk9/V+gmgyDMN8HVg0GYZhBmDRZBiGGYBFk2EYZgAWTYZhmAFYNBmGYQZg0WQYhhmARZNhGGaALymaWsJHPag+6o+WANBbxSiRFv60vr/G1+gF8zV5m2jatT4gDkO2Lzv8ccR+Hm77Y9sSyb8XNKIaaeHhvqdNa2Xrxq7bL6yKHzZelFN7jCDDHMGbRFNLrxBais9ZHon6nOqhbTfVvpdFcxSjhNSuj9r/ByFePJMroYnwHcN8T04QzSxNAQCAvy4i6pBR6mM2YkEajDrVQy1BaitfpSfd1XtLFvz44fVaoW4mHceyzjDflPa/3A4AUuU7x7Bzk9q+F0IAiMtFQCozAiDbbmZU1h6ulV5x611qdzFJCZOt4z0UW9OvW73AVryUVBy0tWr+FKZLf8jWbSCV8Vvln07FnM/VOFvjQghrqWir5mHZJ6GMlkKIuB139VV2XIA253jTjq4cFjGGeRvtTFPLqFhhRsc3d7eJDa9FmqalW/8FQZL8WhDqZ17LpHZi2qclFFKMHVuRKG/0QrespR6S1Qt/jIoC1vAns2PURSnpq0XLweeNOIc2K31HKpsi//E9FEIonWTY2T68cYAw+maHiDHMG9kUTTRZ47qzizmspXxh4MQkkm5+G7U2Nc7ZApQibS+5pBdUSdJDXAv7Q65qdJGIod2cG+3+Qxl8cyLj3HPD6FEZrZxsJ2eXsmLHn3PiyBNjcUDEWDSZ9zIqmv6NUaK6mBPFMWZrg96oRWaaLyy5jV7Qi5DwENfCrcfv0IEgMh1jGLaydqdcZrUbcfYGmn1vorVX7ZDm1+yE37V7xuKQiLFoMu9l80wT3DoNN3n7XlwuAuD3PwBAShle48Z6My3AR1e4FnGmWSSV2XWp0cfwudGLxtFr4WFa68/wVA7tT3qiUAYjb/2nKyOUVgLgcsc+/0bH2X7rvtCNUDR///dilW2dM9nKdvWpPhfXalf2jhj/fM+8kV/94fbGzpQ5lDLydx4L5nvwq4tm+hw2pyCngiPPY8F8B3550WQYhjkTFk2GYZgBWDQZhmEGYNFkGIYZgEWTYRhmABZNhmGYAVg0GYZhBmDRZBiGGYBFk2EYZgAWTYZhmAH+A5LQU8c8XWFDAAAAAElFTkSuQmCC" alt="" />

  这里注意:一旦设置的 data,就意味使用 POST 请求,如果要使用 GET 请求,请在 url 的后面加上 ? 号,再将转换好的数据放进去。

例如:

GET方法:

import urllib
params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query?%s" % params)
print f.read()

POST方法:

import urllib
params = urllib.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
f = urllib.urlopen("http://www.musi-cal.com/cgi-bin/query", params)
print f.read()

6. urllib.pathname2url(path)

  将本地路径名转换成 url 风格的路径名,这不会产生一个完整的URL,而且得到的结果已经调用了 quote() 方法进行了处理。

例子:

a = r'd:\abc\def\123.txt'
print urllib.pathname2url(a)

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAARIAAAAlCAIAAAABXfHVAAAEfElEQVR4nO2czXUjKxCFKyxCeImIMKQIHjvtXga9NXtF0YtHBkqCWTQ/BRQ06Nhjjed+Cx25VRQF1KVA9jH5Ppo2O/iYYzfSe/xhbxruWj3cYWjCm8Zs13RV5smeP426G+e9e6jQvLXx3ntn7so8y3gexvnySRXhVdvS81dGWBJb5U6fRm3WP80ldd3aVAz8t91ddTZNDds3gwh3TVcqIhEDqPoKfqy+xgViM0bbh9813Y1Lr8nh6cyLNm08gp/OKAaehVWmuiNGnDvvvffuKS2eGJ8wgJDZdpPWrzeAmTd8STJWj2dQ8PN1EUpzdSW6Ed2oCInNtmzDec6ujt3mEuVKOYC2915S8lZiX9FPll92+BF+5K9VkOOYm925imdGWnaT1uJk3UeyYbv4rvNWLWN12ma6+xCJ0di9k3DiXl7YOHNPWe7MY6DbJsLK81dFKMEyPqR7rDYqNWltauZWp9r+y7HHUE3yU40r9y7kqNSq7Ms9LnFHs/pwGD3bjT5ZNm08vr+Cjec8G22rxWoTM+kmJ1NFOAXFJu3Rwm5lWYiWes/bkt6j+tNg0v7R2vz3T94UeY/jCA+Du1JXoluxzJ8fYcqMktCKeVYPF5pH48JGZGJ16u3fe++duR8zFsee/LC+eO95P76x+WlayaWGz8yRixvRjfSm6UrqX6Kb1lt6PQp+6k7bzqzWNp1R8BWURxHjUdtFpfMbX3c+G0nwt7FslqiOGTXSwek3kyLslaNvj/BzaUvNz+jr+/lE2fQIu8X7zKnVVT18uwjBm/MbZAPATwOyAWAZyAaAZSAbAJaBbABYBrIBYBnIBoBlIBsAlpmWjdWkyr8BmHnymufwlIhWfgNZ+HFGnTU/TFqj10YB/iYmZeOMqtJLfnL5ONJdzMdJzwmrF2TT+hk3T/aVWeHHavZRlFke2epYwU9hTjZrpSYmotXnu/Zoa1+RjeBn3LzzafITJMH+wPoS/VtNpD/48AbaD5+b04HM2IA3YUY2s6UmPkkZyXUjpuk43eZlI/p5QTaVH27Tez8T6cxAlkor+GYmZLN8q2GyGadC6Scdgrj8tCai5s5SnYqqeI5CoVSWQGgVjNhpq3RUj0uO3xnFrKweH9HYQS7UqBAJu7yVNhDP+3Mqm9VS46dlc9rK6iOJcp6KF5LKT7TOj1P7cfWbuh2VmhkYDj52RmnLT30TLsBbcSabV75Aixkgp9jQT7HjikKSzmJyiSjk1+zl0kHr7HbUv4itHgjFwUM2fwxj2bxQanzKgKFq+lt7/iTmllBt0gVajEcZl2tVEYhznWpzfjs6akR8b4qvBE6+s07fkdgYj7aFv9YGvDVD2bxYasRTepmmYp6FY//looiU+d9bY+xxDckto/v8pV3Pz3EtSsLhIbHLTddPPY7i/lF+2zYsYrw/bUMbZVz9i6neb5HAOzL+zzUvlJoZXmv1/n7A3wL+uAaAZSAbAJaBbABYBrIBYBnIBoBlIBsAloFsAFgGsgFgGcgGgGUgGwCWgWwAWAayAWAZyAaAZX4B1+NONa4kBcIAAAAASUVORK5CYII=" alt="" />

7. urllib.url2pathname(path)

  将 url 风格的路径转换为本地路径风格,是上面方法的逆运算。

a = r'www.xxx.com/12%203/321/'
print urllib.url2pathname(a)

aaarticlea/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZMAAAA4CAIAAAC30oKvAAAIPElEQVR4nO2cv4/kNBTH/We5oQPE33ASJwrGDUfPgZQtjqU7U410OhANZSSqdT+6Gq45LhJrWB3SVtMgrVa66vQo4iT+HWcmszPZ/X40Gs14nefnZ+ebZyc77B8AAFga7NgOAADAZNhLAABYGowAAGBpQLkAAMsDygUAWB77KZfecL7R+/rQCFYxVqt97cQ4fQ/vBlWzGeIwS1tbyfPBbMR4qEvqHIPSOM8+o2YNyOFmy3yWM8rVzrAzxs6YaNJ11tJyRInuEHYmJgRyp7jrDWeV25blswmQ7+GunNqpUjI6TmWhzOgIRVquyw7c0bFuOFJBywdzK1ejk7ukzt1j932UeWfUVvK5rE3qxdEsJ5VLyzWXWyIiakRaJq1qZCq346E3PDswWm6sv+4witbcVXV3+Wpk57PqzAUelrK3hwekcHQMw4Wu78XBumNfVPWGs+hMvaNguiN42KOIpiYUx59R8Z66vZgzhrNaTiqXnapYbQTh9pdjhcrl2dlhFO1D/MOdoOy4YNzfw8OxLR0dUzlMgg7UHfeiqmomahHJ7O4mmLu1srNvUxOKo8+o0dmSqnMSlpPKVW5XCc8hewVnvgrVZkZnjNUX5kP7Gk4kKdeMnQ35kVkMtqeo2Rfw67QVg7xPuhVsD5WoGKuFqDo7d+GhQ+SotdTuxodTJ2TKqDsXuohyDQt8N1cyrXcxUaYwfZTfVht2dxXTWuNrzuqf5JqxNecV4/WKVyZifXODw0FU/Toxf9wRvEj0wqya+92GyLgXRcPteyNYJUTN2JkQdXpMG+HMw2BmptoK50YYkLBOv7XSXkiiPfVGcDQa0dk7j+XxMZ1BuUjV1o5JeFINyxklw1Oo/1p1OtKWd2nFkDGFdYiii0G18U9438NKKNfyIT106dOlvtH23LY3bsI6HuXKlbrQZZKvsPVGsIpFM+tcW8aOEp0q2RFj9QU1wkz69j2VQUejWpIRR+wkepGxHB/lYFXo9F2JtuO10o3Suaj68zBW4rYV+hO6HZ1jVZdJJFcqsbQxEY3c7J3L8siYFq0WSW+zq618ODpxUXU6V8ycUbmTzTorBpTIT+KIncN5GItVFVyR2rM3X8dmWzo6/s5L6lz1cgGv9dQcCq78flteXua0fmG+2u+ek3mfgwukn6UWqNtwGS9vPWHHzT253Lh3DzNRnTSRUn8KB7RkBN3CyD5dTP1HZu+8lpNjWrZDP3YTylqOJa/GLBpo1SSGKnqtc+poue6FZmyvx/MwvGodxMMYlugYxemuWsOqKqzjUzY60QvdkPW0rvb3NPx+Da1Hzp/YUW5berPqLiptDjJYVjWbWblCfyg9goFlZ0/WOyozyvk4V0J6EyyMami5pK3ynMueY8bDyJky9DTcp0tEgzKzdy7LI2Na8lTEaFreL8e6Q8I1jqrd5KirKZpBbkXjrNgjK2q7zs+fDSJ9FtvKiXrYVlhz3u1tHdDD/uR0cXYc+uVAE9vnyj+PMjY6sVtd/s5O9OkKu3Vrc8GKT3BUPOFydgAVdTmOqAWrGH/e7QSZd2tzx98ndaLq10n0wh7BeC/6fbd6xSvrMZoqFg17n8u1E4uzErWirVJWW/6YNlI20pmH7sxM+RzfA3WHPpw/nbXIIMZHMFbHt5yYvTNYptExnesZ+jBjdIit4O6Y3sNUUnZ0D+flcE/lHLetU2PGvmdS9THGHkIaYxnPcNkc+r9/muHO3Wlgblg4OdppeQgeJsHMLKHfHprlceslgf9bBAAsDygXAGB5QLkAAMsDygUAWB5QLgDA8oByAQCWB5QLALA8oFwAgOVx35VLS84P84zeXJYP5yEA95f7rlwJYVDCemZeS84MySfplWB+ncByWCdh2Wkd0gXAdO69cgXCYPTF+pmJVfdnJRLapURvQktu6ZJl2a9zEbXstx7zEAAwxv1XrpgwKDFoR+pzArtKSnJMneJWIF0ATCStXJo4o9WKGCMpiXG6LCiJnH6aOCPGSLUfuqPyJemfyiPGiDGyfuzclCgiJcxnviLOyPpZWE8q4go1JiBKBOvJmOVI3uZnZ0GNAskEAAzkci7BSRMJQVoZKSkpiaCJC9KSuJxSErXDSBGRIia6Ek6aSAnisitXxDjpvg6FwhARitK8xzs0KjluYWw/LDgG0gXAFHLKJTlJSYwNalJSEsUoy8QSj1DX+pL2Q/g+2M7qjRKseLk2TbpilgvUDgCQJadcShATJDmpKSUhWpJQpOWwfCspifpjxEiZar08CUZSkxIklPNuHWsLg/NNS27tXMm+/OsfrtSr/25uPzi774EUdbYidRKW4yIF6QKgnJxyaUmKSMppJZI76iO52bcSjBij55fjJe3hnh2iYctsWAZ2JUKZDS8hiXfvToLWC4P/7IL1nTFblZ6cX/FHbz5ZvX324vry12+TD05ELAct9ZYjT1cEdgAAYzyAe4uGKStCIiL65bctf/Smf33+VHcp2L6W5/IQgAdLUrnam3T34LUz6tV/tnK1L5OCvXu/u10AwN48nJxrMn+9ex8qV/t69uL62N4B8KCBcuX46PGfkC0AThAoV44vvvs7VK7Xze2x/QLgoQPlynH+8jpUrifnV8f2C4CHDpQrh3d7EWkXACcClCvHH29v+aM3H3/59tXvN/atRqRdABwXKFeOm9sP3/z4b/8M1+vm9tNVg7QLgKMD5ZrG5bv3j59qpF0AHBco12Rubj989f0V0i4AjgiUa0fOX14j7QLgWEC5AADLA8oFAFgeUC4AwPKAcgEAlgeUCwCwPKBcAIDlAeUCACwPKBcAYHlAuQAAywPKBQBYHv8DTAyfBk7QFlQAAAAASUVORK5CYII=" alt="" />

  同样的调用了 quote() 方法进行了处理


其他的内容请参考官方文档:https://docs.python.org/2/library/urllib.html