Jsoup Tutorials

1. Introduction
2. Input
   2.a From String
   2.b From URL
   2.c From File
3. Parsing Data
   3.1 parsing element by id
   3.2 parsing element by tag
   3.3 parsing element by class
   3.4 parsing element by attributes
   3.5 parsing sibling elements
   3.6 parsing parent and children elements
4. Selectors
   4.1 find elements by id
   4.2 find elements by tag
   4.3 find elements by tag in a namespace
   4.4 find elements by class name
   4.5 find elements by attribute
   4.6 find elements by attribute value
   4.7 find elements by attribute start, end or contains
   4.8 find elements by attribute value with regular expression
5. Selector combinations
   5.1 find elements with id
   5.2 find elements with class
   5.3 find elements with attribute
6. Pseudo selectors
   6.1 find elements whose sibling index is less than
   6.2 find elements whose sibling index is greater than
   6.3 find elements whose sibling index is equal to
   6.4 find elements that contain elements matching the selector
   6.5 find elements that do not match the selector
   6.6 find elements that contain the given text. The search is case-insensitive
   6.7 find elements that directly contain the given text
   6.8 find elements whose text matches the specified regular expression
   6.9 find elements whose own text matches the specified regular expression
7. Extract attributes, text, and HTML from elements
8. Modifying Data
   8.1 setting attribute values
   8.2 setting html of an element
   8.3 setting text content of elements

Introduction:

1. Jsoup is a java parser
2. You can parse any html page using jsoup easily.
3. You can take any part of the html webpage easily using jsoup.

Reading Content From String:

[java highlight=”11,12″]
package in.javadomain;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class StringJsoup {

public static void main(String[] args) {

String input = "<html><head></head><body><span id=\"content\">I am span content</span></body></html>";
Document fileContent = Jsoup.parse(input);
Elements divContent = fileContent.select("span#content");
System.out.println(divContent.text());

}
}
[/java]

Output:
[plain gutter=”false”]
I am span content
[/plain]

Reading content From File:

test.html:
[html]
<html>
<head>
</head>
<body>
<div id="content">
I am div content
</div>
</body>
</html>
[/html]

Reading Above File using Jsoup:
[java]
package in.javadomain;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

public class FileJsoup {

public static void main(String[] args) {

try {
File inputFile = new File("D:\\test.html");
Document fileContent = Jsoup.parse(inputFile, "UTF-8");
System.out.println(fileContent);
} catch (IOException e) {
e.printStackTrace();
}

}
}
[/java]

Output:
[plain gutter=”false”]
<html>
<head>
</head>
<body>
<div id="content">
I am div content
</div>
</body>
</html>
[/plain]

Parsing Div content with tag:
[java highlight=”16″]
package in.javadomain;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class FileJsoup {

public static void main(String[] args) {

try {
File inputFile = new File("D:\\test.html");
Document fileContent = Jsoup.parse(inputFile, "UTF-8");
Elements divContent = fileContent.select("div#content");
System.out.println(divContent);
} catch (IOException e) {
e.printStackTrace();
}

}
}
[/java]

Output:
[plain gutter=”false”]
<div id="content">
I am div content
</div>
[/plain]

Parsing Div Content without tag [value alone]
[java highlight=”18″]
package in.javadomain;

import java.io.File;
import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;

public class FileJsoup {

public static void main(String[] args) {

try {
File inputFile = new File("D:\\test.html");
Document fileContent = Jsoup.parse(inputFile, "UTF-8");
Elements divContent = fileContent.select("div#content");
System.out.println(divContent.text());
} catch (IOException e) {
e.printStackTrace();
}

}
}
[/java]

Output:
[plain gutter=”false”]
I am div content
[/plain]

Leave a Reply