Fetching the internet…

This is a simple tutorial to fetch web pages using php code. This starts with a example to fetch pages, and then sending get and post request :).

First, a few stuff from the php manual. This is just a summary. You can read the whole stuff at http://in.php.net/manual/en/book.curl.php

“””PHP supports libcurl, a library created by Daniel Stenberg, that allows you to connect and communicate to many different types of servers with many different types of protocols. libcurl currently supports the http, https, ftp, gopher, telnet, dict, file, and ldap protocols. libcurl also supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading (this can also be done with PHP’s ftp extension), HTTP form based upload, proxies, cookies, and user+password authentication.

These functions have been added in PHP 4.0.2.”””

First, check whether you have curl installed with php. If not, instructions to install libcurl is given in the above link.
The following code checks whether curl is installed with php or not

<?php
//Check if the extension is loaded
if( !extension_loaded('curl') ){
echo "Oops... Curl extension is not loaded. :-(";
}
else {
echo "Wow... Curl extension is loaded. :-)";
}
?>

Check this code both under cli and under apache. Because sometimes, php doesn’t have curl loaded under cli, but has curl loaded when it runs under apache.

So, now that we have curl, we will proceed fetching a page. :-). How about fetching our Google page. ;-).

<?php
// create a new cURL resource
$ch = curl_init();
// set URL and other appropriate options
curl_setopt($ch, CURLOPT_URL, "http://www.google.co.in/"); //www.google.com gives a 301 moved response.
// grab URL and pass it to the browser
curl_exec($ch);
// close cURL resource, and free up system resources
curl_close($ch);
?>

The curl_setopt is THE function for all the different cases of fetching we might need to do. You better see the options which curl gives us at http://in.php.net/manual/en/function.curl-setopt.php.

Also we need to consider GET and POST methods of the form differently. Sending a GET request is quite easy. All you need to do is to change the URL so that it has the GET request parameters. For POST request, you need to add one more curl_setopt call to specify the POST data. Another important issue here is that, even though this will fetch the javascript code, this can’t run the javascript code. So it’ll be a little difficult to fetch pages loaded through Ajax. Again, remember, I’m not saying impossible. Just difficult. That’s all.

First, we’ll send a GET request to google server.

<?php
$ch = curl_init();
//First, we set the request to GET.
curl_setopt ($ch, CURLOPT_HTTPGET, true);
//Then we add the data to the reuqest URL
curl_setopt($ch, CURLOPT_URL, "http://www.google.co.in/search?q=sp2hari&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a");
//Change the browser agent so that it corresponds to Firefox from a Windows box. You better use a fetcher name instead of firefox here.
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 (CK) Firefox/3.0.1");
//You might want to set the referer, so that in case google checks it, you get proper results. Though google doesn't.
curl_setopt ($ch, CURLOPT_REFERER, "http://www.google.com");
//We don't the header in the output. We need only the HTML content
curl_setopt ($ch, CURLOPT_HEADER, false);
curl_exec($ch);
curl_close($ch);
?>

One more thing I want to mention here is that this can’t be used to fetch image, css or js files. This just gets the HTML content. Maybe you can send a request to the image or js or css file directly and then save the image locally.

Now how about sending a POST request. POST is a little complicated since we should what exactly we need to POST. This can be understood by seeing the form, but the best way is to use a Firefox plugin called “Tamper Data”.

In the example mentioned below, I’m posting the data to indianrailways site and fetching the list of trains from Bangalore to Trichy. 😉

Simple steps to use Tamper Data to get the exact data to be sent to Indian Railways site
1. Install Tamper Data plugin to firefox
2. Browse to http://www.indianrail.gov.in/cgi_bin/inet_srcdestnm_cgi_date.cgi
3. Enter Source Station code as SBC
4. Enter the Destination Station code as TPJ
5. Enter the class as Sleeper Class
6. Enter the date as you want.
7. Don’t click the submit button for now.
8. Open Tamper Data plugin from Options -> Tamper Data
9. Click the Submit Button in the form
10. Analyze the first request.

From the Tamper Data, you see that the data to be posted as “CurrentMonth=4&CurrentDate=1&CurrentYear=2006&lccp_src_stncode=SBC&lccp_dstn_stncode=TPJ&lccp_classopt=SL&lccp_day=01&lccp_month=10”

Tamper Data
The following code fetches the data

<?php
$ch = curl_init();
//First, we set the request to POST.
curl_setopt ($ch, CURLOPT_POST, true);
//Set the url to the action file of the form
curl_setopt($ch, CURLOPT_URL, "http://www.indianrail.gov.in/cgi_bin/inet_srcdest_cgi_date.cgi");
//Set the post data from the value we got from Tamper Data
curl_setopt($ch, CURLOPT_POSTFIELDS, "CurrentMonth=4&CurrentDate=1&CurrentYear=2006&lccp_src_stncode=SBC&lccp_dstn_stncode=TPJ&lccp_classopt=SL&lccp_day=01&lccp_month=10");
//Change the browser agent so that it corresponds to Firefox from a Windows box. You better use a fetcher name instead of firefox here.
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.1) Gecko/2008070208 (CK) Firefox/3.0.1");
//You might want to set the referer, so that in case indianrailways checks it, you get proper results.
curl_setopt ($ch, CURLOPT_REFERER, "http://www.indianrail.gov.in/src_dest_trns.html");
//We don't the header in the output. We need only the HTML content
curl_setopt ($ch, CURLOPT_HEADER, false);
curl_exec($ch);
curl_close($ch);
?>

This gives the HTML page of the result page. Parse it, tamper it and screw it as much as you want and get the list of trains. You can even fetch and use this HTML content to fetch Availability, Fare or Timings. One thing you should remember is that you are trying to do the job of a browser. So all you need to do is to send a proper request. That’s the only thing you should worry about. You need not bother whether the server’s code is in asp, php or jsp.

Happy Fetching 🙂

4 Comments

  1. Pingback: sp2hari@weblog… » Blog Archive » Dictionary attack

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.