ElasticSearch php spider and index page

show you the screen shot:
Screenshot from 2015-04-29 22:15:43
below demo site is on Amazon EC2 free tier, will close before 2016/4/1
http://owen-wen.twbbs.org/

and I have create a github for this simple example
https://github.com/wenchiching/elasticsearch_webspider

the php spider from here http://phpcrawl.cuab.de/
the php dom parser from here http://simplehtmldom.sourceforge.net/

https://github.com/wenchiching/elasticsearch_webspider/commit/65f628441d98471f4d37c030ad08bda31efa67e9?diff=split
this commit show how to retrieve web and index web into ElasticSeach

And when I want to index more page, I encounter a problem that a page will be indexed again, and redundant index produced.

because I didn’t specify an ID when index a page, so if I want to index a page with an ID, what ID should I use? I have no idea.

Can I use an increment integer as ID? No, it will just index the same page with another integer ID

Head first elasticsearch-php

read https://wenchiching.wordpress.com/2015/04/02/head-first-elasticsearch/ to install ElasticSearch

follow the quick start http://www.elastic.co/guide/en/elasticsearch/client/php-api/current/_quickstart.html

1. edit a file named composer.json and content as below

{
    "require": {
        "elasticsearch/elasticsearch": "~1.0"
    }
}

2. download php lib via below command

curl -s http://getcomposer.org/installer | php
php composer.phar install

3. edit a php file (such as: example.php) and content as below

require 'vendor/autoload.php';

$client = new Elasticsearch\Client();
$params = array();
$params['body']  = array('testField' => 'abc');
$params['index'] = 'my_index';
$params['type']  = 'my_type';
$params['id']    = 'my_id';
$ret = $client->index($params);

4. test if index success

curl -XPOST 'localhost:9200/my_index/_search?pretty' -d '
{
  "query": { "match": { "testField": "abc" } }
}'

you can see something like below if success

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 0.30685282,
    "hits" : [ {
      "_index" : "my_index",
      "_type" : "my_type",
      "_id" : "my_id",
      "_score" : 0.30685282,
      "_source":{"testField":"abc"}
    } ]
  }
}