- Connect to elasticsearch server.
>>> from elasticsearch import Elasticsearch
>>> conn = Elasticsearch(hosts='127.0.0.1:8001') # the localhost is 127.0.0.1 and the port is 8001
- Construct the query for sorting documents based on created_at in ascending order.
>>> the_query = { 'query': { 'match_all': { } },
'sort' : { 'created_at': { 'order': 'asc' } } }
- Construct the query in order to look for documents containing 'elasticsearch' in 'tweet' field.
>>> the_query = { 'query': { 'match': { 'tweet': 'elasticsearch' } } }
- Construct the query in order to obtain all _id from all documents.
>>> the_query = { 'query' : { 'match_all': { },
},
'fields' : []
}
- Construct the query in order to obtain all documents created from 1 January 1970.
>>> the_query = { 'query' : { 'filtered': { 'query' : { 'match_all': { } },
'filter': { 'range': { 'created_at':
{
'gte' : datetime.datetime(1970,1,1)
}
}
}
}
}
}
- Construct the query in order to obtain documents containing politician in user_category field and lobbyist in retweet_user_category and all documents are sorted ascending by created_at date.
>>> the_query = { 'query': {
'bool': {
'must':
[ { 'match': { 'user_category': 'journalist' } },
{ 'match': { 'retweet_user_category': 'other' } }
]
}
},
'sort': { 'created_at': { 'order': 'asc' } } }
- Convert a date (11 August 2014 at 11:02:58) in unicode (u'2014-08-11T11:02:58') into a datetime.
>>> import unicodedata
>>> import datetime
>>> the_date_str = unicodedata.normalize( 'NFKD', the_date_unicode ).encode( 'ascii', 'ignore' )
>>> the_date_time = datetime.datetime.strptime( the_date_str, '%Y-%m-%dT%H:%M:%S')
- Utilize 'scan-and-scroll' to process a huge number of documents. One batch (50 documents) is set for 10 minutes.
>>> scanResp = conn.search( index ='tweets', doc_type='tweet-type',
body = the_query, search_type='scan', scroll='2m', size=1000 )
>>> scrollId = scanResp['_scroll_id']
>>> doc_num = 1
>>> response = conn.scroll( scroll_id = scrollId, scroll='2m' )
>>> while ( len( response['hits']['hits'] ) > 0 ):
>>> for item in response['hits']['hits']:
>>> # process the document
>>> # as you wish
>>> doc_num += 1
>>> # end of for item
>>> scroll_id = response['scroll_id']
>>> response = conn.scroll( scroll_id=scrollId, scroll='2m' )
use test;
drop table if exists accounts, customers;
CREATE TABLE customers(
customer_id INT NOT NULL AUTO_INCREMENT,
name VARCHAR(20) NOT NULL,
address VARCHAR(20) NOT NULL,
city VARCHAR(20) NOT NULL,
state VARCHAR(20) NOT NULL,
PRIMARY KEY( customer_id )
) ENGINE=INNODB;
create table accounts(
account_id INT NOT NULL AUTO_INCREMENT,
customer_id INT(4) NOT NULL,
account_type ENUM( 'savings', 'credit' ) NOT NULL,
balance FLOAT( 9 ) NOT NULL,
PRIMARY KEY ( account_id ),
FOREIGN KEY ( customer_id ) REFERENCES customers( customer_id )
) ENGINE=INNODB;
INSERT INTO customers( customer_id, name, address, city, state ) VALUES ( 1, 'Hendra', 'Carolina Mc', 'Amsterdam', '1098XK' );
INSERT INTO accounts( account_id, customer_id, account_type, balance ) VALUES ( 1, 1, 'savings', 10.5 );
- All the 4000 words essential for an educated vocabulary (
anki)
- Introduction (42 MB) (rar)
- Bayesian Network Fundamentals (114 MB) (rar)
Each dataset has its own query list and relevance judgment
(rar)
- adi dataset (language is english, less than 1 MB)
- eng dataset (language is english, about 8.5 MB)
- ina dataset (language is indonesia, about 1 MB)
- med dataset (language is english, about 1 MB)
- npl dataset (language is english, about 3 MB)