Tags: Natural Language Processing, Dimensionality Reduction, Unsupervised Learning.
Dataset: Stack Exchange Data Dump
The StackExchange data dump was analyzed and two major data-driven features to improve its overall operation were proposed. A successful Exchange of knowledge happens when posters ask good questions and the asked question reaches the relevant Scholar. However, some new novice users who ask questions creates some unrelated tags for that questions. To solve this problem, First, we created meta-tags for each question by projecting this question in a high dimensional space and finding correct tags by applying Unsupervised learning algorithm. Second, we found related posts for unique tags - a feature which is missing in the current live version of StackExchange - that could be instantly suggested to the posters. Third, a discussion on building a Virtual Title Assistant to aid new users to better ask questions was made, On further analysing the data, we found that the initial form of the title itself does not seem to impact the overall user experience. *All these proposals are made with emphasis on Data Modelling and Visualization.
During the 2016 US presidential election there was said to be evidence of foreign interference by means of propaganda. The disinformation was allegedly spread over a number of social media platforms including twitter. As such it was a opportunistic to apply the techniques of text analysis to try and uncover evidence of this tampering. The preprocessing of the dataset involve procedures like stemming, lemmatization along with the removal of http links, non-English words, xml tags, etc. This corpus of text data was then used to train a custom embedding (Word2Vec model) having 300 neurons in the hidden layer giving us a 300 Dimension vectors. t_SNE was used to reduce the dimensions to 2D for the sake of Visualization. An indepth Bias analyses was carried out comparing the pre-trained Network to the Custom embedding in both positive and negative tone.
Tags: Graphs, Graphical data modelling, Dimensionality Reduction.
Data: SNAP: reddit-hyperlink graph (55863 nodes[subreddits] and 858490 edges[hyperlinks])
Technologies: PostgreSQL, psycopg2, Neo4j, snap.py, graphviz, Tensorflow, Tensorboard, Python etc.
Social Network: Reddit Hyperlink Network dataset is treated as a directed graph, and intially Neo4j was used for graph modelling. Upon further analysis, The data models were designed to model graph data while taking advantage of the relational model by storing the data as an edge list, better representing the relationships present in connections between nodes (subreddits). This allowed reduced cost of operation, easy relationship analysis and provided the opportunity to take advantage of other relational constraints, such as uniqueness of data. From the edge list we used snap.py to convert the edge list to a graph object, From the graph object we used graphviz for image rendering. The well connected subreddits were found by their connection strength and word count. Also, it is evident from the data that, most of the connections represent trolling references spreading negativity. To improve personized experience, an NSFW/ Child Filter was built by using the sentimental features. Also, closely related subreddits were identified (for recommendation) to better improve the user experience.
Tags: Spacial-Temporal data Modelling, Relational DB
Data: NYC taxi trip data set
Technologies: PostgreSQL, psycopg2, pyshp, Python etc.
The NYC taxi dataset has records of NYC yellow taxi rides with relevant trip information including geo spatial location (pickup/ drop-off location) and temporal (time of pickup/ drop-off). To facilitate information retrivel in dataset of this scale, a computationally efficient data modelling was carried out by taking advantage of relational database constraints and other database constraints (such as enforcing value ranges, informing the database of the requirement of the presence of or uniqueness of data). Shape models were used to render taxi zones and an aggregated center point model was created for minimizing nested loop calculations to a single run-through rather than a run-through on each visualization run. A Few interesting insights were obtained which could recommended top routes for a driver looking to maximize route frequency and tips, and minimize losses from refunds. Also, the frequency of trips in Manhattan, the northern half of Brooklyn, and airports were also shown graphically. A general cycle of routes is identified between the best airport (LaGuardia) and optimal zone locations (northern Brooklyn, outskirts of Manhattan), providing a new driver with the data necessary to get started. Also, recommendations for customer service improvements were also made.
The MovieLens dataset contains records of ratings from different users for movies released over a period of time. To fully analyse and understand the public viewing patterns and other insights, the MovieLens dataset was Integrated with the largest movie dataset in the planet, IMDB data by designing relational data models that maximize space and time computational efficiency, taking full advantage of foreign relationships and database constraints. From this new data source, Insights in shift in users' viewing pattern, bias in the MovieLens dataset, viewerships based on genre, etc. were derived. To quickly, find the quality or watch ability of the movie once released, we gave weightage to the users whose average ratings were close to the general public (IMDB score) with some confidence interval and labelled them as power user to give proporate weightage based on their past ratings. This way, the convergence of new movie ratings to the actual quality of the movie can be retained quickly.
We stimulate a attacks scenario where an adversary attacks a recommendation system by manipulating the reward signals to control the actions chosen by the system. For our attacking scenario, we took the famous Movie Lens Dataset and applied collaborative filtering to withhold the behaviour of the algorithm. The algorithm is able to detect the attack when it deviates from certain threshold and intimates the alert sign.
This work focuses on applying Thompson Sampling - an Online Learning algorithm approach, for Linear generalization setting on any sequence of states observed. The Thompson Sampling approach proposed by Aditya Gopalan uses Gaussian prior and Gaussian likelihood. The results showed that it reduces to Follow the Perturbed Leader(FPL) as in Kalai et al. (2005) with a Gaussian noise as supposed to the exponentially distributed noise. The regret bound for the Thompson Sampling Gaussian algorithm and its proof is briefly covered in this note by combining all the relevant information.