Api para leitura de html

gandja99 · Setembro 28, 2011, 11:13am

Pessoal,

Estou desenvolvendo um aplicativo que lê informações de um website, filtra o que é importante e apresenta ao usuário só o que é interessante para ele, para isso estou usando o HttpClient do appache e tenho algumas dúvidas:

1 - Existem sites que retornam o conteúdo html e outros que não retornam, por que isso acontece? Tem alguma forma de pegar o html de qualquer site? No exemplo abaixo se eu passo a url do google o length retornado é -1, se eu passo a url do terra funciona. Tem solução?

[code]
public static void main(String aRGS[]) throws Exception {
HttpClient client;
client = new DefaultHttpClient();
HttpGet get = new HttpGet(“http://www.terra.com.br”);
HttpResponse resp = client.execute(get);
HttpEntity entity = resp.getEntity();
if (entity != null) {
long len = entity.getContentLength();
System.out.println(len);
if (len != -1) {
System.out.println(EntityUtils.toString(entity));
} else {

        }

    }

}[/code]

2- Em .Net existe uma api que organiza o html que estou trabalhando no código, de forma que eu possa navegar nele hierarquicamente. Tipo, eu posso pegar uma div, separar em um objeto para depois trabalhar com um universo menor de informações. Em Java tem algo parecido?

alissonvla · Setembro 28, 2011, 12:05pm

cara,

nesse link segue alguns framework para fazer isso http://java-source.net/open-source/html-parsers , so nao sei te falar qual é o melhor.

t+

kiko_lp_St_jimmy · Setembro 28, 2011, 12:41pm

Tenta esse exemplo aqui pra pegar o html, testei com o google e funcionou.

[code]import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
public class TestHttpGet {
public void executeHttpGet() throws Exception {
BufferedReader in = null;
try {
HttpClient client = new DefaultHttpClient();
HttpGet request = new HttpGet();
request.setURI(new URI(“http://www.google.com.br/”));
HttpResponse response = client.execute(request);
in = new BufferedReader
(new InputStreamReader(response.getEntity().getContent()));
StringBuffer sb = new StringBuffer("");
String line = “”;
String NL = System.getProperty(“line.separator”);
while ((line = in.readLine()) != null) {
sb.append(line + NL);
}
in.close();
String page = sb.toString();
System.out.println(page);
} finally {
if (in != null) {
try {
in.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

public static void main(String[] args) throws Exception {
	TestHttpGet teste = new TestHttpGet();
	teste.executeHttpGet();
}

}[/code]

system · Dezembro 29, 2015, 7:36am

Api para leitura de html

Cursos de Mobile

Cursos de Programação

Cursos de Front-end

Cursos de DevOps

Cursos de Design & UX

Cursos de Business

Cursos de Data & BI