show you the screen shot:
below demo site is on Amazon EC2 free tier, will close before 2016/4/1
http://owen-wen.twbbs.org/
and I have create a github for this simple example
https://github.com/wenchiching/elasticsearch_webspider
the php spider from here http://phpcrawl.cuab.de/
the php dom parser from here http://simplehtmldom.sourceforge.net/
https://github.com/wenchiching/elasticsearch_webspider/commit/65f628441d98471f4d37c030ad08bda31efa67e9?diff=split
this commit show how to retrieve web and index web into ElasticSeach
And when I want to index more page, I encounter a problem that a page will be indexed again, and redundant index produced.
because I didn’t specify an ID when index a page, so if I want to index a page with an ID, what ID should I use? I have no idea.
Can I use an increment integer as ID? No, it will just index the same page with another integer ID