At times you need to visit web sites, login, navigate through pages, select portions of HTML, click on links, check for the existence of a form, submit the form,.... and do all these things programmatically. So you need a programmable web browser that can execute and have a cup of tea while it will do the job.
Java SE API has the HTMLEditorKit that you can use to parse HTML pages. I have used it once. But it's very limited in capabilities. It's for parsing, but not to implement simple or complex navigation scenarios.
I have been a lover of httpunit to do things of this nature. I have used it to navigate through pages and fetch content in really complex scenarios. It's so powerful that it even knows how to execute JavaScript. httpunit is a JUnit extension. So you can use it to write JUnit test cases for your project. Let's not worry about test cases. Let's look at how to use it for some simple navigation and parsing the response pages. Look at the code sample given below. Follow the comments to understand what each important line does.
package org.swview.mybrowser; import java.io.IOException; import org.xml.sax.SAXException; import com.meterware.httpunit.HttpUnitOptions; import com.meterware.httpunit.TableCell; import com.meterware.httpunit.WebConversation; import com.meterware.httpunit.WebLink; import com.meterware.httpunit.WebResponse; import com.meterware.httpunit.WebTable; public class ProgrammaticBrowser { /** * @param args */ public static void main(String[] args) { // Don't throw exceptions when JavaScript errors occur HttpUnitOptions.setExceptionsThrownOnScriptError(false); // Here's the browser WebConversation wc = new WebConversation(); try { // Fetch a page WebResponse response = wc.getResponse("http://www.swview.org/"); // Get the link with the text "Contact" WebLink link = response.getLinkWith("Contact"); // Click on the link and get the next response response = link.click(); // Get all tables WebTable[] tables = response.getTables(); // Get the first table WebTable firstTable = tables[0]; // Get the cell at first row, second column TableCell emailCell = firstTable.getTableCell(0, 1); // Print the content as text System.out.println(emailCell.getText()); } catch (IOException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } } }
Given that httpunit.jar and the other jar file dependencies of httpunit are added to the class path, you can execute the above like this (after compilation):
java org.swview.mybrowser.ProgrammaticBrowser
You can even manipulate JavaScript generated content. For example, you can click on a JavaScript generated button. You can even navigate to another browser window that may pop up when you click on a link. I love it!
See below to download this simple application as an Eclipse project with httpunit libraries packed inside.