Extracting Names from a Webpage using JSoup and Apache OpenNLP in Java

Extracting Names from a Webpage using JSoup and Apache OpenNLP in Java

In this article, we will look at how to extract names from a webpage using the JSoup library and Apache OpenNLP Name Finder API in Java. JSoup is a Java library for parsing HTML and XML documents, while Apache OpenNLP is a machine learning library for natural language processing tasks such as named entity recognition (NER).

Prerequisites

Before we begin, you will need to have the following installed on your machine:

  • Java Development Kit (JDK) 8 or higher

  • Apache Maven 3.0 or higher (for managing dependencies)

Set up the project

First, create a new Maven project in your favourite Java IDE with the following command:

mvn archetype:generate -DgroupId=com.example.app -DartifactId=Name-Extractor -DarchetypeArtifactId=maven-archetype-quickstart -DarchetypeVersion=1.4

Installing the Dependencies

To use Jsoup and the Apache OpenNLP Name Finder API, we will need to add the following dependencies to our project:

  • Jsoup: a Java library for parsing HTML

  • Apache OpenNLP: a natural language processing toolkit

We can use Maven to manage these dependencies. To install them, add the following dependency declarations to your pom.xml file:

<dependencies>
  <dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.13.1</version>
  </dependency>
  <dependency>
    <groupId>org.apache.opennlp</groupId>
    <artifactId>opennlp-tools</artifactId>
    <version>2.7.0</version>
  </dependency>
</dependencies>

You can then run the following command to install the dependencies:

mvn clean install

Download the Name Finder model

Next, we will need to download the Name Finder model from the Apache OpenNLP website. The Name Finder model is a machine learning model that is trained to identify names in text. In this example, we will be using the English Name Finder model, which can be downloaded here. Save the model file to your project directory.

Parsing the HTML of a Webpage with Jsoup

First, we need to parse the HTML of the webpage that we want to extract names from. We can use the Jsoup library to do this.

To parse the HTML of a webpage, we will use the Jsoup.connect method to establish a connection to the webpage and the get method to retrieve the HTML content:

String url = "http://www.example.com";
Document doc = Jsoup.connect(url).get();

The Document object returned by the get method represents the HTML of the webpage and provides methods for traversing and extracting information from the page.

In this example, we are interested in extracting the text of the body element of the webpage. We can use the body method to get a reference to the body element and the text method to get the text content of the element:

String text = doc.body().text();

With the text of the body element in hand, we can now use the Apache OpenNLP Name Finder API to extract names from the text.

Extracting Names with the Apache OpenNLP Name Finder API

The Apache OpenNLP Name Finder API is a tool for identifying named entities, such as people's names, within a piece of text. To use the Name Finder API, we first need to load a model that has been trained to recognize names.

The following code shows how to load a Name Finder model using the TokenNameFinderModel class and the FileInputStream class:

InputStream modelIn = new FileInputStream("en-ner-person.bin");
TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
modelIn.close();

Once the model has been loaded, we can create a Name Finder instance using the NameFinderME class and the model:

NameFinderME nameFinder = new NameFinderME(model);

Before we can use the Name Finder to identify names in the text, we need to split the text into tokens. We can do this using the split method:

String[] tokens = text.split("\\s+");

With the tokens prepared, we can now use the Name Finder to identify names in the text. The find method of the NameFinderME class returns an array of Span objects, each of which represents a named entity in the text:

Span[] names = nameFinder.find(tokens);

Finally, we can cast the name Spans to Strings, iterate over them and print them

// Converting the name spans to Strings
String [] Names = Span.spansToStrings(names,tokens);
for (String Name:Names) {
    System.out.println(Name);
}

Here is the complete code for extracting names from a webpage using Jsoup and the Apache OpenNLP:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import opennlp.tools.namefind.NameFinderME;
import opennlp.tools.namefind.TokenNameFinderModel;
import opennlp.tools.util.Span;

import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class NameExtractor {

    public static void main(String[] args) throws IOException {
        // Load the Name Finder model
        InputStream modelIn = new FileInputStream("en-ner-person.bin");
        TokenNameFinderModel model = new TokenNameFinderModel(modelIn);
        NameFinderME nameFinder = new NameFinderME(model);

        // Specify the URL to extract names from
        String url = "https://www.example.com";
        Document doc = Jsoup.connect(url).get();

        //split the text into tokens
        String[] tokens = text.split("\\s+");

        //Pass the tokens into th name Finder model to get the names
        Span[] names = nameFinder.find(tokens);

        // Converting the name spans to Strings
        String [] Names = Span.spansToStrings(names,tokens);
        for (String Name:Names) {
            System.out.println(Name);
        }

}

Conclusion:

Thanks for reading and I hope you've now learnt how to use Jsoup and Apache OpenNLP ! If you have any questions or improvements, please let me know in the comments below.