JTidy - JTidy



JTidy - JTidy

JTidy is a Java port of HTML Tidy, a HTML syntax checker and pretty printer. Like its non-Java cousin, JTidy can be used as a tool for cleaning up malformed and faulty HTML. In addition, JTidy provides a DOM interface to the document that is being processed, which effectively makes you able to use JTidy as a DOM parser for real-world HTML.

JTidy was written by Andy Quick, who later stepped down from the maintainer position. Now JTidy is maintained by a group of volunteers.


Read full article from JTidy - JTidy


Z IN ASCII -- Solr Distributed Search and the Stale Check



Z IN ASCII — Solr Distributed Search and the Stale Check

25 Feb 2013 Since version 1.3.0 Solr has provided support for Distributed Search . Distributed Search allows one to run a query against a set of shards, handling the coordination and collation of the results on behalf of the user. This allows the splitting of an index across multiple nodes while at the same time keeping the convenience of a non-distributed search. As an index grows there inevitably comes a time when a single machine cannot adequately take full responsibility of it. The problems vary from disk space constraints, inadequate processing power, or large document sets with common terms. At some point, splitting the index makes sense because it allows to share the load as well as parallelize some of the work. Distributed Search, for the most part, hides the fact that the query is distributed. Yes, the shards must be passed in, so the user must know which shards make up the full index,

Read full article from Z IN ASCII — Solr Distributed Search and the Stale Check


TortoiseSVN - Extended context menu



TortoiseSVN - Extended context menu

Testimonials What users say about TortoiseSVN Useful tips Tips about not well known features Mailing lists Where to find the mailing lists Report bugs How and where to report a bug Misc Misc posts Posts that don't fit into any other category Open Source Extended context menu Many people don't know that the Windows shell provides an extended context menu if the shift key is pressed when the menu is shown. For example, on Vista the extended menu has a few additional entries. The screenshots below show this: the left menu is the normal, plain menu and the right menu is the extended menu which you get if you hold down the shift key while right clicking: As you can see, the extended menu has the additional entry "Open Command Window Here" which opens a console window with the current path set to that folder, and the entry "Copy as Path" which copies the path of the file/folder to the clipboard. However if you try this on the explorer tree view on the left, it won't work.

Read full article from TortoiseSVN - Extended context menu


serialization - Serialize Lucene's OpenBitSet 4.9.0 using Java - Stack Overflow



serialization - Serialize Lucene's OpenBitSet 4.9.0 using Java - Stack Overflow

  • Extend OpenBitSet to get access to protected long[] bits and protected int wlen. These are the only ones that provide state.
  • Implement Externalizable, and (de)serialize those two fields in readExternal and writeExternal.

  • Read full article from serialization - Serialize Lucene's OpenBitSet 4.9.0 using Java - Stack Overflow


    Salmon Run: Lucene Search within Search with BitSets



    Salmon Run: Lucene Search within Search with BitSets

    Saturday, April 14, 2007 The Search within Search functionality is typically used when presenting results from an index in response to a user defined query. In addition to the search results returned to the user, we may want to show the user that he can drill down to get more focused search results for his query. So, for example, if the user searched for "cancer" on a medical search engine, in additions to pages that discuss cancer, we may want to show him how many occurences of "brain cancer", "lung cancer", etc, he can get from our index. These would be represented as a set of navigation links on the page. Clicking one of these links would spawn a fresh search with the augmented query term. If you use Lucene , you will know that the popular ways of doing this is to either use a Boolean Query or search using a QueryFilter. A less popular, but incredibly powerful, way to count facet hits is to use BitSets returned by the QueryFilter.

    Read full article from Salmon Run: Lucene Search within Search with BitSets


    How to implement row level access control in Lucene



    How to implement row level access control in Lucene

    09.28.08 In the below example there are two fields: DATA: Which contains any data that you want your users to be able to search. NOTE: You can have as many data fields as you like. ACL_FIELD: The field used to determine what users have access to this document. Note: You can have as many access control fields as you like. All you have to do is built the access control query for each user and submit your user's query unchanged. public class TestIndexerSearcher { public static void main(String[] args) throws Exception { Directory directory = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter(directory, new StandardAnalyzer()); indexWriter.addDocument(buildDocument("DATA:sametoken","ACL_FIELD:access")); indexWriter.addDocument(buildDocument("DATA:sametoken","ACL_FIELD:noaccess")); indexWriter.optimize(); indexWriter.close(); IndexSearcher indexSearcher = new IndexSearcher(directory); QueryParser parser = new QueryParser("DATA", new StandardAnalyzer()); Query query = parser.

    Read full article from How to implement row level access control in Lucene


    Changing Bits: Fast search filters using flex



    Changing Bits: Fast search filters using flex

    Fast search filters using flex A filter in Lucene is a bit set that restricts the search space for any query; you pass it into IndexSearcher's search method. It's effective for a number of use cases, such as document security, index partitions, facet drill-down, etc. To apply a filter, Lucene must compute the intersection of the documents matching the query against the documents allowed by the filter. Today, we do that in IndexSearcher like this: while (true) { if (scorerDoc == DocIdSetIterator.NO_MORE_DOCS) { } } We call this the "leapfrog approach": the query and the filter take turns trying to advance to each other's next matching document, often jumping past the target document. When both land on the same document, it's collected. Unfortunately, for various reasons this implementation is inefficient (these are spelled out more in LUCENE-1536 ): The advance method for most queries is costly. The advance method for most filters is usually cheap.

    Read full article from Changing Bits: Fast search filters using flex


    Nest - Introduction



    Nest - Introduction

    You've reached the documentation page for Elasticsearch.Net and NEST. The two official .NET clients for Elasticsearch. So why two clients I hear you say?

    Elasticsearch.Net is a very low level, dependency free, client that has no opinions about how you build and represent your requests and responses. It has abstracted enough so that all the Elasticsearch API endpoints are represented as methods but not too much to get in the way of how you want to build your json/request/response objects. It also comes with builtin, configurable/overridable, cluster failover retry mechanisms. Elasticsearch is elastic so why not your client?

    NEST is a high level client that has the advantage of having mapped all the request and response objects, comes with a strongly typed query DSL that maps 1 to 1 with the Elasticsearch query DSL, and takes advantage of specific .NET features such as covariant results. NEST internally uses, and still exposes, the low level Elasticsearch.Net client.

    Please read the getting started guide for both.


    Read full article from Nest - Introduction


    Elasticsearch.org All About Elasticsearch Filter BitSets | Blog | Elasticsearch



    Elasticsearch.org All About Elasticsearch Filter BitSets | Blog | Elasticsearch

    When building queries in Elasticsearch, you'll often find yourself composing sets of filters. Say you need to filter users that are: Gender: Male Age: 23-26 Language: English Generally you should place these filters inside of a  Boolean Filter . But wait…isn't there an  And Filter ? What about the  Or Filter  and  Not Filters ?  Are they simply an alternative syntax to the Bool filter? The short answer is: No, they are very distinct. More importantly, they can have a big impact on your query performance. BitSets The key to understanding the difference will require a detour into BitSets.  Fundamentally, a  BitSet  is just an array of bits that represent state.  A position in the bit array is either a 1 or a 0. Filters don't score documents – they simply include or exclude.  If a document matches a filter, it is represented with a one in the BitSet; otherwise a zero. This means that Elasticsearch can store an entire segment's filter state ("who matches this particular filter?

    Read full article from Elasticsearch.org All About Elasticsearch Filter BitSets | Blog | Elasticsearch


    Changing Bits: Your test cases should sometimes fail!



    Changing Bits: Your test cases should sometimes fail!

    Your test cases should sometimes fail! I'm an avid subscriber of the delightful weekly (sometimes) Python-URL! email, highlighting the past week's interesting discussions across the numerous Python lists . Each summary starts with the best quote from the week; here's last week's quote : "So far as I know, that actually just means that the test suite is insufficient." - Peter Seebach, when an application passes all its tests. I wholeheartedly agree: if your build always passes its tests, that means your tests are not tough enough! Ideally the tests should stay ahead of the software, constantly pulling you forwards to improve its quality. If the tests keep passing, write new ones that fail! Or make existing ones evil-er. You'll be glad to know that Lucene/Solr's tests do sometimes fail, as you can see in the Randomized testing Our test infrastructure has gotten much better, just over the past 6 months or so, through heavy use of randomization.

    Read full article from Changing Bits: Your test cases should sometimes fail!


    svn - How to view all revisions in TortoiseSVN? - Stack Overflow



    svn - How to view all revisions in TortoiseSVN? - Stack Overflow

    I'm really sorry but did you try to click Show All button? Or I misunderstood something

    Also you may change the amount of revisions which will be showed at the beginning: Settings --> General --> Dialogs 1 --> Default number of messages


    Read full article from svn - How to view all revisions in TortoiseSVN? - Stack Overflow


    ShawnHeisey - Solr Wiki



    ShawnHeisey - Solr Wiki

    GC Tuning for Solr The secret to GC tuning: Eliminating full garbage collections. A full garbage collection is slow, no matter what collector you're using. If you can set up the options for GC such that both young and old generations are always kept below max size by their generation-specific collection algorithms, performance is almost guaranteed to be good. G1 (Garbage First) Collector With the newest Oracle Java versions, G1GC is looking extremely good. Do not try these options unless you're willing to run the latest Java 7 or Java 8, preferably from Oracle. Experiments with early Java 7 releases were very disappointing. My testing has been with Oracle 7u72, and I have been informed on multiple occasions that Oracle 8u40 will have much better G1 performance. I typically will use a Java heap of 6GB or 7GB. These settings were created as a result of a discussion with Oracle employees on the hotspot-gc-use mailing list : JVM_OPTS=" \ -XX:+UseG1GC \ -XX:

    Read full article from ShawnHeisey - Solr Wiki


    Code Reuse in Google Chrome Browser - good coders code, great reuse



    Code Reuse in Google Chrome Browser - good coders code, great reuse

    Laurence J. Peter As everyone already knows, Google released a new open-source web browser called Chrome . Having interest in code reuse, I downloaded the source code and examined all the open-source libraries used. Google Chrome browser shows excellent example of code reuse. I found that they use at least 25 different software libraries! Here is the full list of libraries, along with relative paths to source code and short library descriptions. Many of the libraries have been patched by googlers; look for README.google files in each library directory for information about changes. Library /src/v8 Google's open source JavaScript engine. V8 implements ECMAScript as specified in ECMA-262, 3rd edition, and runs on Windows XP and Vista, Mac OS X 10.5 (Leopard), and Linux systems that use IA-32 or ARM processors. V8 can run standalone, or can be embedded into any C++ application. /src/testing/gtest Google's framework for writing C++ tests on a variety of platforms (Linux, Mac OS X, Windows,

    Read full article from Code Reuse in Google Chrome Browser - good coders code, great reuse


    [SOLR-6705] SOLR Start script generates warnings on java 8 due ot experimental options - ASF JIRA



    [SOLR-6705] SOLR Start script generates warnings on java 8 due ot experimental options - ASF JIRA

    Java Bugs in various JVMs affecting Lucene / Solr Sometimes Lucene runs amok of bugs in JVM implementations from different vendors. In certain cases we whittle it down to a small test case, open an issue with the vendor, and hopefully it gets fixed. In other cases we know the bug is in the JVM but we haven't narrowed it enough to open a bug with the vendor. Sometimes we can work out a simple workaround in Lucene. We try to open a Lucene mirror bug to provide details on how Lucene is affected, iterate on a compact test case, etc. Contents Oracle Java / Sun Java / OpenJDK Bugs If you are affected by one of these issues that Oracle's Java has yet to accept or resolve, or simply have some spare votes, please consider adding your vote to the bug (on Oracle's bug page): Do not, under any circumstances, run Lucene with the G1 garbage collector. Lucene's test suite fails with the G1 garbage collector on a regular basis, including bugs that cause index corruption.

    Read full article from [SOLR-6705] SOLR Start script generates warnings on java 8 due ot experimental options - ASF JIRA


    Introducing streaming k-means in Spark 1.2 - Databricks



    Introducing streaming k-means in Spark 1.2 – Databricks

    January 28, 2015 | by Jeremy Freeman (Howard Hughes Medical Institute) Many real world data are acquired sequentially over time, whether messages from social media users, time series from wearable sensors, or — in a case we are particularly excited about — the firing of large populations of neurons. In these settings, rather than wait for all the data to be acquired before performing our analyses, we can use streaming algorithms to identify patterns over time, and make more targeted predictions and decisions. One simple strategy is to build machine learning models on static data, and then use the learned model to make predictions on an incoming data stream. But what if the patterns in the data are themselves dynamic? That’s where streaming algorithms come in. A key advantage of Spark is that its machine learning library (MLlib) and its library for stream processing (Spark Streaming) are built on the same core architecture for distributed analytics.

    Read full article from Introducing streaming k-means in Spark 1.2 – Databricks


    java - implements Closeable or implements AutoCloseable - Stack Overflow



    java - implements Closeable or implements AutoCloseable - Stack Overflow

    Closeable extends AutoCloseable, and is specifically dedicated to IO streams: it throws IOException instead of Exception, and is idempotent, whereas AutoCloseable doesn't provide this guarantee.

    This is all explained in the javadoc of both interfaces.

    Implementing AutoCloseable (or Closeable) allows a class to be used as a resource of the try-with-resources construct introduced in Java 7, which allows closing such resources automatically at the end of a block, without having to add a finally block which closes the resource explicitely.

    Your class doesn't represent a closeable resource, and there's absolutely no point in implementing this interface: an IOTest can't be closed. It shouldn't even be possible to instanciate it, since it doesn't have any instance method. Remember that implementing an interface means that thee is a is-a relationship between the class and the interface. You have no such relationship here.


    Read full article from java - implements Closeable or implements AutoCloseable - Stack Overflow


    Nexus 5 can't connect to Win8 PC via USB. [Solved]



    Nexus 5 can't connect to Win8 PC via USB. [Solved]

    1) First check MTP is turned on in the Android settings window (In settings, under storage, click the 3 dots). This was already checked for me, so wasn't the problem.

    Read full article from Nexus 5 can't connect to Win8 PC via USB. [Solved]


    codebytes: DelayQueue Usage Java Concurrency



    codebytes: DelayQueue Usage Java Concurrency

    DelayQueue is an unbounded BlockingQueue of objects that implement the Delayed interface. An object can only be taken from the queue when its delay has expired. The queue is sorted so that the object at the head has a delay that has expired for the longest time. If no delay has expired, then there is no head element and poll() will return null.

    Read full article from codebytes: DelayQueue Usage Java Concurrency


    codebytes: Java Concurrency : Usage of BlockingQueue class



    codebytes: Java Concurrency : Usage of BlockingQueue class

    BlockingQueue allows you to put objects into it and take out objects one at a time. So you don't need to worry about concurrent threads' synchronization for queue access. If there are no elements, the accessing thread simply blocks and resumes when elements are added for the thread to access.

    Read full article from codebytes: Java Concurrency : Usage of BlockingQueue class


    codebytes: Concurrency : Using the "synchronized" keyword [Java]



    codebytes: Concurrency : Using the "synchronized" keyword [Java]

    Let us suppose you have two fields in a class and a method that modifies the fields non-atomically.
    Let's say you need to increment int fields by two and you use

    1  field1++;
    2  Thread.yield(); //This is optional. It just says "I've done the important work, some other thread may be                                  //selected for execution now
    3  field1++;

    so that the current thread gets it's operations paused in the middle while other thread is selected by the JVM
    for execution. According to our assumptions, the value of field1 should always be even, otherwise exception
    is to be thrown.

    Read full article from codebytes: Concurrency : Using the "synchronized" keyword [Java]


    Producer Consumer Design Pattern with Blocking Queue Example in Java



    Producer Consumer Design Pattern with Blocking Queue Example in Java

    Producer Consumer Design pattern is a classic concurrency or threading pattern which reduces coupling between   Real World Example of Producer Consumer Design Pattern   Benefit of Producer Consumer Pattern Its indeed a useful design pattern and used most commonly while writing multi-threaded or concurrent code. here is few of its benefit: 1) Producer Consumer Pattern simple development. you can Code Producer and Consumer independently and Concurrently, they just need to know shared object. 2) Producer doesn't need to know about who is consumer or how many consumers are there. Same is true with Consumer. 3) Producer and Consumer can work with different speed. There is no risk of Consumer consuming half-baked item. In fact by monitoring consumer speed one can introduce more consumer for better utilization.

    Read full article from Producer Consumer Design Pattern with Blocking Queue Example in Java


    System Prevalence - Wikipedia, the free encyclopedia



    System Prevalence - Wikipedia, the free encyclopedia

    The topic of this article may not meet Wikipedia's general notability guideline . Please help to establish notability by adding reliable, secondary sources about the topic. If notability cannot be established, the article is likely to be merged , redirected , or deleted . In a prevalent system, state is kept in memory in native format, all transactions are journaled and System images are regularly saved to disk. System images and transaction journals can be stored in language-specific serialization format for speed or in XML format for cross-language portability. The first usage of the term and generic, publicly available implementation of a system prevalence layer was Prevayler , written for Java by Klaus Wuestefeld in 2001. [1] Contents Simply keeping system state in RAM in its normal, natural, language-specific format is orders of magnitude faster and more programmer-friendly than the multiple conversions that are needed when it is stored and retrieved from a DBMS . As an example,

    Read full article from System Prevalence - Wikipedia, the free encyclopedia


    kotek.net



    kotek.net

    3 billion items in Java Map with 16 GB RAM One rainy evening I meditated about memory managment in Java and how effectively Java collections utilise memory. I made simple experiment, how much entries can I insert into Java Map with 16 GB of RAM? Goal of this experiment is to investigate internal overhead of collections. So I decided to use small keys and small values. All tests were made on Linux 64bit Kubuntu 12.04. JVM was 64bit Oracle Java 1.7.0_09-b05 23.5-b02 . There is option to use compressed pointers (-XX:+UseCompressedOops), which is on by default on this JVM. First is naive test with java.util.TreeMap . It inserts number into map, until it runs out of memory and ends with exception. JVM settings for this test was -Xmx15G import java.util.*; Map m = new TreeMap(); for(long counter=0;;counter++){ m.put(counter,""); if(counter%1000000==0) System.out.println(""+counter); } This example ended at 172 milion entries.

    Read full article from kotek.net


    Journal of a Girl in IT: Java HashMap with millions and millions of records



    Journal of a Girl in IT: Java HashMap with millions and millions of records

    Monday, June 3, 2013 Java HashMap with millions and millions of records I am just trying to find an answer to this problem. If you have millions and millions of records stored in a map, you are sure to run out of memory. Here is some alternates.. MapDB JDBM2 Buzz Hash Some basics on HashMap An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets. Your problem is that 1.

    Read full article from Journal of a Girl in IT: Java HashMap with millions and millions of records


    Solr Configuration Best Practices and Troubleshooting Tips : Support



    Solr Configuration Best Practices and Troubleshooting Tips : Support

    Darla Baker Configuration Best Practices (Straying from these recommendations is the root cause of 90% of issues) Hardware Use SSDs if possible If using mechanical disks, setup at least 4 volumes with a set of dedicated heads for each: OS, commit log, SSTables, Solr data  If using SSDs you can choose either RAID0 or JBOD Additional cores are helpful because it will increase query/indexing throughput Hardware/Cluster Sizing Create a CF with the proposed schema and configurations Load one thousand mock/sample records Extrapolate from those numbers the index size for the expected total record count Example If your index size is 1GB, and you expect one million records then your index size will be 1000GB. Your cluster must be large enough so that the total cluster memory is large enough to cache the total index size, and hot dataset, subtracting for the JVM heap and OS overhead. Assume 1GB of memory for the OS and 14 GB of memory for the JVM heap, or an overhead of 15GB.

    Read full article from Solr Configuration Best Practices and Troubleshooting Tips : Support


    [Solr-user] Disable cache ? - Grokbase



    [Solr-user] Disable cache ? - Grokbase

    I think you could disable Solr caches by setting their size to 0 (deleting
    them won't work, as for example, the FieldValueCache will take default
    values, not sure about the other ones). I don't think you'll be able to
    disable Lucene's Field Cache.

    Read full article from [Solr-user] Disable cache ? - Grokbase


    Use of cache=false and cost parameters | Solr Enterprise Search



    Use of cache=false and cost parameters | Solr Enterprise Search

    Parameter cache=false Setting the cache parameter to false we tell Solr not to cache current query results. This parameter can also be used as a filter query (fq) attribute, which tell Solr not to cache filter query results. What do we get from such behavior ? Let's imagine the following filter as a part of the query: fq={!frange l=10 u=100}log(sum(sqrt(popularity),100)) If we know that the queries with filter like the above one are rare, we can decide not to cache them and not change cache state for irrelevant data. To do that we add the cache=false attribute in the following way: fq={!frange l=10 u=100 cache=false}log(sum(sqrt(popularity),100)) As I told, adding this additional attribute will result in the filter results not being cached. Parameter cost The additional feature of Solr 3.4 is the possibility to set filter cost in case of those filters that we don't want to cache. Filter queries with specified cost are executed as last ones after all the cached filters.

    Read full article from Use of cache=false and cost parameters | Solr Enterprise Search


    codebytes: Calculating LCM of N numbers using Euclid's Algorithm



    codebytes: Calculating LCM of N numbers using Euclid's Algorithm

    This program calculates the LCM of N numbers using Euclid's method.

    Visit here for calculation of LCM using Common Factors Grid Method.

    Read about Euclid's method here.

    Read full article from codebytes: Calculating LCM of N numbers using Euclid's Algorithm


    GUI automation using a Robot | Undocumented Matlab



    GUI automation using a Robot | Undocumented Matlab

    Wednesday, September 15th, 2010 Hello there! If you are new here, you might want to subscribe to the RSS feed or email feed for updates on Undocumented Matlab topics. I would like to welcome guest writer Takeshi (Kesh) Ikuma . Kesh has posted several interesting utilities on the Matlab File Exchange, including the award-winning Enhanced Input Dialog Box . Today, Kesh will describe how we can automate GUI actions programmatically. Automating GUI actions, including controlling a mouse and keyboard programmatically, is often useful. This can be used, for example, to demonstrate GUI usage or to perform a recorded macro. Matlab's Handle Graphics interface provides a simple way to manipulate the mouse cursor position, but not to emulate mouse or keyboard clicks. However, we can utilize Java's java.awt.Robot class. This article provides an overview of the Robot class and how it can be used to program mouse movement, button clicks and keyboard strikes. java.awt.

    Read full article from GUI automation using a Robot | Undocumented Matlab


    Write Your Own Custom Automation Using java.awt.Robot | Javalobby



    Write Your Own Custom Automation Using java.awt.Robot | Javalobby

    The following tip shows how to use java.awt.Robot to create your own handy custom made automation. Running this example will open up Firefox in your system and type in twitter.com, loading the page for you. Isn't that cool? Try it yourself.

    Read full article from Write Your Own Custom Automation Using java.awt.Robot | Javalobby


    codebytes: Using java.awt.Robot class



    codebytes: Using java.awt.Robot class

    The java.awt.Robot class is used to generate native system input events for the purposes of test automation, self-running demos, and other applications where control of the mouse and keyboard is needed. The primary purpose of Robot is to facilitate automated testing of Java platform implementations.

    Read full article from codebytes: Using java.awt.Robot class


    codebytes: Displaying the contents of an arbitrarily complex nested list - Coding Interview Question



    codebytes: Displaying the contents of an arbitrarily complex nested list - Coding Interview Question

    Write a function (in pseudo-code) called dumpList that takes as its parameters a string and a reference to an arbitrarily complex nested list and prints the value of each list element on a separate line. The value of each line should be preceded by the string and numbers indicating the depth and index of the element in the list. Assume that the list contains only strings and other nested lists.

    Let's take a look at an example of what we want exactly in the dumpList function. Suppose that you are given the following nested list. A nested list is just a list that contains other lists as well – so in the list below you see that it also contains the lists ['a','b','c'] and ['eggs'] :

    Read full article from codebytes: Displaying the contents of an arbitrarily complex nested list - Coding Interview Question


    codebytes: Quick Sort With Minor Improvements



    codebytes: Quick Sort With Minor Improvements

    1. Setting the element with index hi (the pivot element) equal to the median of the three elements lo, hi and (lo+hi) / 2 so that the probability that it lies in between the values is more.

    2. Using the faster Insertion sort if the number of elements in the sub array is less than some predefined value (the value 5 is used here).

    Read full article from codebytes: Quick Sort With Minor Improvements


    codebytes: 3-Way-Quicksort with Improvements [Tukey's Ninther]



    codebytes: 3-Way-Quicksort with Improvements [Tukey's Ninther]

    1. If the number of elements is less than 10, use insertion sort.
    2. If the number of elements is more, swap the first element with mid element.
    3. If the number of elements is even more, swap the first element of the subarray with the median of the three elements (lo, mid, hi).
    4. If the number of elements is even more, swap the first element with (Tukey's Ninther).

    Read full article from codebytes: 3-Way-Quicksort with Improvements [Tukey's Ninther]


    codebytes: Division without using the Division Operator [Java]



    codebytes: Division without using the Division Operator [Java]

    Division without using the Division Operator [Java]

    This program divides two numbers without using the division operator (/).

    Read full article from codebytes: Division without using the Division Operator [Java]


    codebytes: Calculating the value of PI accurately to 5 decimal places



    codebytes: Calculating the value of PI accurately to 5 decimal places

    This program uses the Gregory-Leibniz series for calculating the value of PI. Note that this series is slow and there exist other faster algorithms for calculating the value of PI. 

    A simple infinite series for π is the Gregory–Leibniz series:
     \pi = \frac{4}{1} - \frac{4}{3} + \frac{4}{5} - \frac{4}{7} + \frac{4}{9} - \frac{4}{11} + \frac{4}{13} - \cdots
    As individual terms of this infinite series are added to the sum, the total gradually gets closer to π, and – with a sufficient number of terms – can get as close to π as desired. It converges quite slowly, though – after 500,000 terms, it produces only five correct decimal digits of π.

    Read full article from codebytes: Calculating the value of PI accurately to 5 decimal places


    Apple In New Security Concessions To Beijing - Forbes



    Apple In New Security Concessions To Beijing - Forbes

    Connect Opinions expressed by Forbes Contributors are their own. Recent Posts Popular Posts Full Bio I have lived and worked in China for 15 years, much of that as a journalist for Reuters writing about Chinese companies. I currently live in Shanghai where I teach financial journalism at Fudan University. I write daily on my blog, Young's China Business Blog (www.youngchinabiz.com), commenting on the latest developments at Chinese companies listed in the US, China and Hong Kong. I am also author of a new book about the media in China, "The Party Line: How the Media Dictates Public Opinion in Modern China." Bottom line: Apple's allowance of audits of its products by Chinese inspectors marks its latest compromise to address China's national security concerns, and could mark the start of a more transparent approach on the issue by Beijing. Global gadget leader Apple is deepening its uneasy embrace with Beijing security officials,

    Read full article from Apple In New Security Concessions To Beijing - Forbes


    codebytes: Manage multiple stacks using an ArrayList of Stack objects.



    codebytes: Manage multiple stacks using an ArrayList of Stack objects.

    Q. Imagine a (literal) stack of plates. If the stack gets too high, it might topple. Therefore, in real life, we would likely start a new stack when the previous tack exceeds some threshold. Implement a data structure that minics this. This Data Structure should be composed of several stacks and should create a new stack once the previous one exceeds capacity. It's push() and pop() should behave identically to a single stack (that is, pop() should return the same values as it would if there were just a single stack).

    Implement a function popAt(int index) which performs a pop operation on a specific sub-stack.

    Approach:

    1. Use an ArrayList to manage individual stacks.
    2. When a stack gets full, create another one and push the value on to the new stack.
    3. If a stack gets empty, remove it from the ArrayList.

    Read full article from codebytes: Manage multiple stacks using an ArrayList of Stack objects.


    codebytes: Implement a stack whose push, pop and min operations all operate in O(1) time.



    codebytes: Implement a stack whose push, pop and min operations all operate in O(1) time.

    Q. How would you design a stack which, in addition to push and pop, also has a function min which returns the minimum element? Push, pop and min should all operate in O(1) time.

    Algorithm:

    1. Each stack element stores the value of that element and an index of the element that will be the lowest if the current element was removed.
    2. int mI (field in class Stack) stores the index of the smallest element in the stack.
    3. When we add an element we set it's lowest index value field to mI. If current element is smaller than lowest, mI is set to current index.
    4. When we remove an element, if it is the mI index, we set mI to the local mI stored in that element.

    Read full article from codebytes: Implement a stack whose push, pop and min operations all operate in O(1) time.


    codebytes: An algorithm for returning the node at the beginning of a loop in a singly linked list.



    codebytes: An algorithm for returning the node at the beginning of a loop in a singly linked list.

    Thursday, December 25, 2014 An algorithm for returning the node at the beginning of a loop in a singly linked list. Q. Given a circular linked list, implement an algorithm which returns the node at the beginning of the loop. DEFINITION Circular linked list: A (corrupt) linked list in which a node's next pointer points to an earlier node, so as to make a loop in the linked list. EXAMPLE Input: A -> B -> C -> D -> E -> C (the same C as earlier) Output: C Methods: #1. Use a Set to record every node that you visit. When a node that is already present in the list is encountered, the is the beginning of the loop. If pointer encounters null, no loop is present. #2. Consider two pointers p1 and p2. p2 travels 2 nodes at a time, p1 travels 1 node at a time. Let them meet (if there is a loop, they will surely meet but if there isn't the fast node will encounter a null). Now that they have met, set p1 to head of the list.  Now let p1 and p2 travel 1 node at a time.

    Read full article from codebytes: An algorithm for returning the node at the beginning of a loop in a singly linked list.


    Bitonic Merge | Techblog



    Bitonic Merge | Techblog

    Given an array of size n wherein elements keep on increasing
    monotonically upto a certain location after which they keep on
    decreasing monotonically, then again keep on increasing, then
    decreasing again and so on. Sort the array in O(n) and O(1).


    or same question can be asked in a different way

    Given an array of n elements and an integer k where k
    {a[0].....a[k] and a[k+1].....a[n] are already sorted. Give an
    algorithm to sort in O(n) time and O(1) space.

    Read full article from Bitonic Merge | Techblog


    Algo Ramblings: Find element in bitonic array



    Algo Ramblings: Find element in bitonic array

    Given an array which is monotonically increasing and then decreasing . Write an algorithm to search for a given element. Expected algorithm O(Log n)

    An array of number is bitonic if it consists of a strictly increasing sequence followed by a strictly decreasing sequence. We can solve the problem in 3logN comparisons by finding the maximum in the array and then doing two binary searches, one on the increasing and one on the decreasing sequence. The maximum can be found in Log N comparisons using binary search. Each step compares two adjacent numbers A[i] and A[i+1]. If hey are equal, they are both maximum. If A[i] is smaller, we restrict the search to indices at most i, if A[i+1] is bigger, we restrict the search to indices greater than i.


    Read full article from Algo Ramblings: Find element in bitonic array


    Trie树分词 � 码农场



    Trie树分词 � 码农场


    Read full article from Trie树分词 � 码农场


    Damerau-Levenshtein distance - Wikipedia, the free encyclopedia



    Damerau–Levenshtein distance - Wikipedia, the free encyclopedia

    The Damerau–Levenshtein distance differs from the classical Levenshtein distance by including transpositions among its allowable operations. The classical Levenshtein distance only allows insertion, deletion, and substitution operations.[5] Modifying this distance by including transpositions of adjacent symbols produces a different distance measure, known as the Damerau–Levenshtein distance

    Read full article from Damerau–Levenshtein distance - Wikipedia, the free encyclopedia


    提取中文句子主谓宾的Java实现 � 码农场



    提取中文句子主谓宾的Java实现 � 码农场

    利用依存关系可以提取句子的主要成分(也就是小学和公务员考试中出现的"提取主干"),可以实现语义上的智能理解。在中文里,我的感受是,大部分句子都有主谓宾,很少缺主语或宾语,三个全缺的几乎没有。所以我猜可以利用主谓宾短语来作为句子的主干,检索的时候主干匹配的话则给予更高的分数,或者用于智能推荐。

    Read full article from 提取中文句子主谓宾的Java实现 � 码农场


    (1) What Is The Best Nosql Database In Terms Of Performance? - Quora



    (1) What Is The Best Nosql Database In Terms Of Performance? - Quora

    This page may be out of date. Save your draft before refreshing this page.Submit any pending changes before refreshing this page. What is the best NoSQL database in terms of performance? There are many new NoSQL databases like MongoDB, CouchDB, etc... Which one is the best when you look from performance angle? There's a lot of NoSQL databases for a reason: they address different needs. From performance angle, the fastest one will be the one whose data model perfectly fits your application data model. For example, if you need to model a social graph, a graph database like Dex or Neo4J is likely to smoke everything else. If you only need to quickly retrieve a value according to a key, go for something like Membase, KumoFS or Kyoto Tycoon. If you need structures like lists, sets, ordered sets, and hashes, Redis is your best bet. If you need to associate a slightly more complex structure (document) to a key, go for MongoDB or CouchDB. With a single query,

    Read full article from (1) What Is The Best Nosql Database In Terms Of Performance? - Quora


    NOSQL Databases



    NOSQL Databases

    Non-Relational Universe! Next Generation Databases mostly addressing some of the points: being non-relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above. [based on 7 sources, 15 constructive feedback emails (thanks!) and 1 disliking comment . Agree / Disagree? Tell me so! By the way: this is a strong definition and it is out there here since 2009!] List Of NoSQL Databases [currently 150] Core NoSQL Systems: [Mostly originated out of a Web 2.0 need] Wide Column Store / Column Families [OpenNeptune, Qbase, KDI] Key Value / Tuple Store : A fast,

    Read full article from NOSQL Databases


    Solr or Elasticsearch--That Is the Question



    Solr or Elasticsearch--That Is the Question

    Otis Gospodnetić That is the common question I hear: Which one is better, Solr or Elasticsearch? Which one is faster?  Which one scales better?  Which one can do X, and Y, and Z?  Which one is easier to manage?  Which one should we use?  Which one do you recommend? These are all great questions, though not always with clear and definite, universally applicable answers. So which one do we recommend you use? How do you choose in the end? Well, let me share how I see Apache Solr and Elasticsearch past, present, and future, and let's do a bit of comparing and contrasting, and hopefully help you make the right choice for your particular needs. Early Days: Youth Vs. Experience Apache Solr is a mature project with a large and active development and user community behind it, as well as the Apache brand.  First released to open-source in 2006, Solr has long dominated the search engine space and was the go-to engine for anyone needing search functionality.

    Read full article from Solr or Elasticsearch--That Is the Question


    Solr vs ElasticSearch: Part 5 - Management API Capabilities | Sematext Blog



    Solr vs ElasticSearch: Part 5 – Management API Capabilities | Sematext Blog

    [Note: for those of you who don’t have the time or inclination to go through all the technical details, here’s a high-level, up-to-date (2015) Solr vs. Elasticsearch overview ] In previous posts, all listed below, we’ve discussed general architecture, full text search capabilities and facet aggregations possibilities. However, till now we have not discussed any of the administration and management options and things you can do on a live cluster without any restart. So let’s get into it and see what Apache Solr and ElasticSearch have to offer. ElasticSearch As you probably know ElasticSearch offers a single way to talk to it – its HTTP REST API – JSON structured queries and responses. In most cases, especially during query time, it is very handy, because it let’s you perfectly control the structure of your queries and thus control the logic. Apache Solr On the other hand we have Apache Solr.

    Read full article from Solr vs ElasticSearch: Part 5 – Management API Capabilities | Sematext Blog


    String.intern in Java 6, 7 and 8 - string pooling  - Java Performance Tuning Guide



    String.intern in Java 6, 7 and 8 - string pooling  - Java Performance Tuning Guide

    Menu String.intern method was implemented in Java 6 and what changes were made in it in Java 7 and Java 8. First of all I want to thank Yannis Bres for inspiring me to write this article. 07 June 2014 update: added 60013 as a default string pool size since Java 7u40 (instead of Java 8), added -XX:+PrintFlagsFinal This is an updated version of this article including -XX:StringTableSize=N String.intern() String pooling String pooling (aka string canonicalisation) is a process of replacing several String objects with equal value but different identity with a single shared String object. You can achieve this goal by keeping your own Map (with possibly soft or weak references depending on your requirements) and using map values as canonicalised values. Or you can use String.intern() At times of Java 6 using String.intern() was forbidden by many standards due to a high possibility to get an OutOfMemoryException String.

    Read full article from String.intern in Java 6, 7 and 8 - string pooling  - Java Performance Tuning Guide


    Cheney's algorithm - Wikipedia, the free encyclopedia



    Cheney's algorithm - Wikipedia, the free encyclopedia

    Cheney's algorithm This article includes a list of references , but its sources remain unclear because it has insufficient inline citations . Please help to improve this article by introducing more precise citations. (April 2014) Cheney's algorithm, first described in a 1970 ACM paper by C.J. Cheney, is a method of garbage collection in computer software systems. In this scheme, the heap is divided into two equal halves, only one of which is in use at any one time. Garbage collection is performed by copying live objects from one semispace (the from-space) to the other (the to-space), which then becomes the new heap. The entire old heap is then discarded in one piece. Cheney's algorithm reclaims items as follows: Object references on the stack. Object references on the stack are checked. One of the two following actions is taken for each object reference that points to an object in from-space: If the object has not yet been moved to the to-space,

    Read full article from Cheney's algorithm - Wikipedia, the free encyclopedia


    Applications of Breadth First Traversal - GeeksforGeeks



    Applications of Breadth First Traversal - GeeksforGeeks

    We have earlier discussed Breadth First Traversal Algorithm for Graphs. We have also discussed Applications of Depth First Traversal . In this article, applications of Breadth First Search are discussed. 1) Shortest Path and Minimum Spanning Tree for unweighted graph In unweighted graph, the shortest path is the path with least number of edges. With Breadth First, we always reach a vertex from given source using minimum number of edges. Also, in case of unweighted graphs, any spanning tree is Minimum Spanning Tree and we can use either Depth or Breadth first traversal for finding a spanning tree. 2) Peer to Peer Networks. In Peer to Peer Networks like BitTorrent, Breadth First Search is used to find all neighbor nodes. 3) Crawlers in Search Engines: Crawlers build index using Bread First. The idea is to start from source page and follow all links from source and keep doing same. Depth First Traversal can also be used for crawlers, but the advantage with Breadth First Traversal is,

    Read full article from Applications of Breadth First Traversal - GeeksforGeeks


    A Sudoku Solver in Java implementing Knuth’s Dancing Links Algorithm



    A Sudoku Solver in Java implementing Knuth's Dancing Links Algorithm

    A Sudoku Solver in Java implementing Knuth's Dancing Links Algorithm For the Harker Research Symposium Version: 1.2 Knuth's paper on Dancing Links can be found here or follow the credits links below. Dr. Donald Knuth's Dancing Links Algorithm solves an Exact Cover situation. The Exact Cover problem can be extended to a variety of applications that need to fill constraints. Sudoku is one such special case of the Exact Cover problem. The Journey In early 2006, I participated in the ACSL Competition. The prompt of the competition was to create a simple Sudoku solver. I had never solved a Sudoku puzzle so I researched on methods to approach the puzzle. I came across several strategy guides, Sudoku forums, and computer solvers. Hinted amongst the computer programs was the dancing links algorithm. However, given the time and simplicity required for the competition, I reverted to a simple brute force candidate elimination algorithm to solve the simple Sudoku given by the ACSL. However,

    Read full article from A Sudoku Solver in Java implementing Knuth's Dancing Links Algorithm


    mihneagiurgea/fuxia · GitHub



    mihneagiurgea/fuxia · GitHub

    Install and Run Run difficulty_level can be [1,2,3,4,5] without the simulate flag, the program will not simulate picloud on the localhost and will try to connect to the site. Running Tests test files are in the test/ folder and can be run using nosetests: $ nosetests test/ -v there are several input boards we use for testing, each with a different difficulty level. All can be found in the subfolder: fixtures/ Project Description: The Distributed Sudoku Generator Project generates Sudoku games having different levels of difficulty. Traditionaly the Sudoku games are divided into five categories, according to human perceived difficulty. We have decided to keep the same five levels in our implementation of the generator: extremely easy easy medium difficult evil "Generating Sudoku puzzles is easy. Generating evil Sudoku puzzles is... EVIL." Four factors affecting the difficulty level are taken into consideration in this metrics respectively as follows: the total amount of given cells,

    Read full article from mihneagiurgea/fuxia · GitHub


    Notre Dame researcher helps make Sudoku puzzles less puzzling // News // Notre Dame News // University of Notre Dame



    Notre Dame researcher helps make Sudoku puzzles less puzzling // News // Notre Dame News // University of Notre Dame

    Author: William G. Gilroy Published: October 11, 2012 A Sudoku puzzle. For anyone who has ever struggled while attempting to solve a Sudoku puzzle, University of Notre Dame complex networks researcher Zoltan Toroczkai and Notre Dame postdoctoral researcher Maria Ercsey-Ravasz are riding to the rescue. They can not only explain why some Sudoku puzzles are harder than others, they have also developed a mathematical algorithm that solves Sudoku puzzles very quickly, without any guessing or backtracking. Toroczkai and Ercsey-Ravasz, of Romania's Babeş-Bolyai University, began studying Sudoku as part of their research into the theory of optimization and computational complexity. They note that most Sudoku enthusiasts use what is known as a "brute force" system to solve problems, combined with a good deal of guessing. Brute force systems essentially deploy all possible combinations of numbers in a Sudoku puzzle until the correct answer is found. While the method is successful,

    Read full article from Notre Dame researcher helps make Sudoku puzzles less puzzling // News // Notre Dame News // University of Notre Dame


    davidbau.com Sudoku Generator



    davidbau.com Sudoku Generator

    Sudoku Generator Here is a free Sudoku generator that can generate puzzles of varying difficulty in PDF , Postscript , plaintext , and HTML . It is a nice example of the website fun you can have with 250 lines of Python over a Labor day weekend; it also makes a handy command-line Sudoku solver... What is Sudoku? Have you ever played Sudoku ? A few months ago my dad raved about a new game in the newspaper that he liked solving better than the daily crossword, and so I picked up one of the five hundred or so Sudoku books at the bookstore. I was quickly hooked. The rules of Sudoku are simple: finish filling in the squares of a 9x9 grid so that the digits 1-9 appear exactly once in each of the nine rows, columns, and 3x3 blocks. Puzzles are designed so that there is only one correct way to fill in the 81 squares, and they tend to be just hard enough to be satisfying: not too easy, not impossible. Ways to Play Sudoku is a good solo game; if you are having trouble getting into it,

    Read full article from davidbau.com Sudoku Generator


    2D matrix with 0s and 1s. Try ... | CareerCup



    2D matrix with 0s and 1s. Try ... | CareerCup

    0 For example: [[1,1,1,0] [1,1,0,0] [0,0,0,1]] return 3, because one for 1s, one for 0s, and one for the last one. another example:     Country: - 0 This is my algo countries = 0 for( elements in matrix ) if element is not visited countries++ dfs from element and mark similar elements in path as visited print countries Comment hidden because of low score. Click to expand. 0 of 0 vote We can use DFS to find the connected components. We will not want to step on an alien element while doing a DFS. The number of distinct DFS forests will be the number of countries. bool visited[m][n]; int xi[]={0, 1, 1, 1, 0, -1, -1, -1}; int xj[]={1, 1, 0, -1, -1, -1, 0, 1}; int number_of_countries(int **country_map, int m, int n) { int i, j; int count=0; init(); for(i=0; i

    Read full article from 2D matrix with 0s and 1s. Try ... | CareerCup


    GameInternals - Understanding Pac-Man Ghost Behavior



    GameInternals - Understanding Pac-Man Ghost Behavior

    All theory, no practice GameInternals aims to spread knowledge of interesting game mechanics beyond the game-specific enthusiast communities. Each post focuses on a specific game mechanic that would normally only be known to high-level players of a particular game, and attempts to explain it in a manner that would be understandable even by readers unfamiliar with that game. GameInternals articles are researched and written by Chad Birch , a gamer and programmer from Calgary, Alberta, Canada. If you have any suggestions for games or mechanics that would make good article topics, please feel free to email me . Posted on December 2, 2010 It only seems right for me to begin this blog with the topic that inspired me to start it in the first place. Not too long ago, I came across Jamey Pittman's "Pac-Man Dossier" , which is a ridiculously-detailed explanation of the mechanics of Pac-Man. I found it absolutely fascinating,

    Read full article from GameInternals - Understanding Pac-Man Ghost Behavior


    Pacxon - A Brief Introduction to Pacman Clones



    Pacxon - A Brief Introduction to Pacman Clones


    Read full article from Pacxon - A Brief Introduction to Pacman Clones


    Pacxon - A Brief Introduction to The Rules of Pacm



    Pacxon - A Brief Introduction to The Rules of Pacm

    Pacman Rules There can't be many people in the western world who haven't heard of the iconic Pacman game. Originally created in 1980 the concept was straightforward. Let's break it down as follows: Pacman Monsters That can't possibly be it - how and why would Pacman, the Boss of the Operation, spend his time chasing little yellow dots round the playing area. Of course, let's not forget the third element - Monsters! Well of course, every self-respecting game must have Monsters, so in they go! Simple Concept The most successful games are easily understandable and draw people in because they feel they can play without looking stupid. The principle of Pacman is similar in that it is a simple game which players want to try, and feel because it is so straightforward they have a realistic chance of winning. In the early days video games were a new and largely untested idea with no-one anticipating the huge demand.

    Read full article from Pacxon - A Brief Introduction to The Rules of Pacm


    The Programmers Idea Book - 200 Software Project Ideas and Tips to Developing Them - Ebook : The Coders Lexicon



    The Programmers Idea Book – 200 Software Project Ideas and Tips to Developing Them – Ebook : The Coders Lexicon

  • 200 programming project ideas for all skill levels
  • 10 different project categories
  • Over 100 pages of project ideas
  • Expert tips for tackling each programming project
  • Projects for any programming language (platform independent)
  • Programs that you can get started on in minutes!

  • Read full article from The Programmers Idea Book – 200 Software Project Ideas and Tips to Developing Them – Ebook : The Coders Lexicon


    Word Search | N00tc0d3r



    Word Search | N00tc0d3r

    tinyurl is a URL service that users enter a long URL and then the service return a shorter and unique url such as "http://tiny.me/5ie0V2". The highlight part can be any string with 6 letters containing [0-9, a-z, A-Z]. That is, 62^6 ~= 56.8 billions unique strings.


    Read full article from Word Search | N00tc0d3r


    Find a Word in a Matrix | Linear Space-Time



    Find a Word in a Matrix | Linear Space-Time


    Read full article from Find a Word in a Matrix | Linear Space-Time


    [Google] Boggle Solver (Search Words From Matrix) - Woodstock Blog



    [Google] Boggle Solver (Search Words From Matrix) - Woodstock Blog

    The best solution is to use Trie, then do DFS search. However it might not be as intuitive as it seems.

    The idea is from this answer (However this guy admits that his solution does not handle 'visited' nodes properly, means the same char might be visited again to produce a word).


    Read full article from [Google] Boggle Solver (Search Words From Matrix) - Woodstock Blog


    algorithm - word search in java 2d array - Stack Overflow



    algorithm - word search in java 2d array - Stack Overflow

    am trying to create a simple word search for a class assignment, and I have managed to figure out how to search east (from left to right) and west(right to left). But I am having trouble trying to figure out how to search south (top to bottom).

    The code that I have works for one file that I read in but the second file returns an ArrayIndexOutOfBoundsException. Is there anything that is specific in my code that would make it un-scalable?

    My corrected code looks like this:


    Read full article from algorithm - word search in java 2d array - Stack Overflow


    Given a 2D matrix of character... | CareerCup



    Given a 2D matrix of character... | CareerCup

    1) Create a hash table with character as a key and value representing the coordinates of the character present in the matrix.
    2) if The search string is "rat",then search for the first character in the hash map and find its coordinates.
    3) After this search for the second character in the hash table and find its coordinates and find the relationship("r") between the coordinates of the first char and the second char,whether it lies in the left straight or right straight or diagonal.
    3) After this see for the third char and check for the relationship of its coordinate with the coordinate of its last char and if it matches with the "r" ,then continue else return false.
    4) Continue this process till the end of the string .

    Time complexity :
    1) In creating hash table : O(m*n)
    2) search for string occurence : O (l) : l is the string length

    Read full article from Given a 2D matrix of character... | CareerCup


    Improving Sort Performance in Apache Spark: It's a Double | Cloudera Engineering Blog



    Improving Sort Performance in Apache Spark: It’s a Double | Cloudera Engineering Blog

    by Sandy Ryza and Saisai (Jerry) Shao January 14, 2015 Cloudera and Intel engineers are collaborating to make Spark’s shuffle process more scalable and reliable. Here are the details about the approach’s design. What separates computation engines like MapReduce and Apache Spark  (the next-generation data processing engine for Apache Hadoop) from embarrassingly parallel systems is their support for “all-to-all” operations. As distributed engines, MapReduce and Spark operate on sub-slices of a dataset partitioned across the cluster. Many operations process single data-points at a time and can be carried out fully within each partition. All-to-all operations must consider the dataset as a whole; the contents of each output record can depend on records that come from many different partitions. In Spark, groupByKey In these distributed computation engines, the “shuffle” refers to the repartitioning and aggregation of data during an all-to-all operation. Understandably, most performance,

    Read full article from Improving Sort Performance in Apache Spark: It’s a Double | Cloudera Engineering Blog


    双倍提升Apache Spark排序性能-CSDN.NET



    双倍提升Apache Spark排序性能-CSDN.NET

    发表于9小时前| 次阅读| 来源Cloudera| 0 条评论| 作者Sandy Ryza and Saisai (Jerry) Shao 区别常见的Embarrassingly Parallel系统,类似MapReduce和Apache Spark(Apache Hadoop的下一代数据处理引擎)这样的计算引擎主要区别在于对“all-to-all” 操作的支持上。和许多分布式引擎一样,MapReduce和Spark的操作通常针对的是被分片数据集的子分片,很多操作每次只处理单个数据节点,同时这些操作所涉及到的数据往往都只存在于这个数据片内。all-to-all操作必须将数据集看作一个整体,而每个输出结果都可以总结自不同分片上的记录。Spark的groupByKey、sortByKey,还有reduceByKey这些shuffle功能都属于这方面常见的操作。 在这些分布式计算引擎中,shuffle指的是在一个all-to-all操作中将数据再分割和聚合的操作。显而易见,在实践生产中,我们在Spark部署时所发现的大多性能、可扩展性及稳定性问题都是在shuffle过程中产生的。 Spark目前的运作实现模式 MapReduce和Spark的shuffle都使用到了“pull”模式。在每个map任务中,数据被写入本地磁盘,然后在reduce任务中会远程请求读取这些数据。由于shuffle使用的是all-to-all模式,任何map任务输出的记录组都可能用于任意reduce。一个job在map时的shuffle操作基于以下原则:所有用于同一个reduce操作的结果都会被写入到相邻的组别中,以便获取数据时更为简单。 Spark默认的shuffle实现(即hash-based shuffle)是map阶段为每个reduce任务单独打开一个文件,这种操作胜在简单,但实际中却有一些问题,比如说实现时Spark必须维持大量的内存消耗,或者造成大量的随机磁盘I/O。此外,如果M和R分别代表着一个shuffle操作中的map和reduce数量,则hash-based shuffle需要产生总共M*R个数量的临时文件,Shuffle consolidation将这个数量减至C*R个(这里的C代表的是同时能够运行的map任务数量),但即便是经过这样的修改之后,在运行的reducer数量过多时

    Read full article from 双倍提升Apache Spark排序性能-CSDN.NET


    Embarrassingly parallel - Wikipedia, the free encyclopedia



    Embarrassingly parallel - Wikipedia, the free encyclopedia

    In parallel computing, an embarrassingly parallel workload, or embarrassingly parallel problem, is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks.[1]

    Embarrassingly parallel problems (also called "perfectly parallel" or "pleasingly parallel") tend to require little or no communication of results between tasks, and are thus different from distributed computing problems that require communication between tasks, especially communication of intermediate results. They are easy to perform on server farms which do not have any of the special infrastructure used in a true supercomputer cluster. They are thus well suited to large, Internet-based distributed platforms such as BOINC, and do not suffer from parallel slowdown. The diametric opposite of embarrassingly parallel problems are inherently serial problems, which cannot be parallelized at all.


    Read full article from Embarrassingly parallel - Wikipedia, the free encyclopedia


    Inspired by Actual Events: JDK 7: The New Objects Class



    Inspired by Actual Events: JDK 7: The New Objects Class

    My observations and thoughts concerning software development (general development, Java, JavaFX, Groovy, Flex, ...). Dustin's Pages JDK 7: The New Objects Class It was announced approximately 18 months ago that JDK 7 would include a new java.util.Objects class that would "hold commonly-written utility methods." As part of this announcement , Joe Darcy asked the community, "What other utility methods would have broad enough use and applicability to go into a common java.util class?" There were forums and posts on the matter and I blogged about this forthcoming class . The JDK 7 preview release includes this class and it can be tried out now. In this post, I look at use of most of the methods provided by this class and look at how NetBeans 6.9 already uses this class in some of its auto-generated methods. The java.util.Objects class is new to JDK 7 and its Javadoc states that the class is "since 1.7" and describes the class as:

    Read full article from Inspired by Actual Events: JDK 7: The New Objects Class


    Zero-copy - Wikipedia, the free encyclopedia



    Zero-copy - Wikipedia, the free encyclopedia

    Zero-copy versions of operating system elements, such as device drivers, file systems, and network protocol stacks, greatly increase the performance of certain application programs and more efficiently utilize system resources. Performance is enhanced by allowing the CPU to move on to other tasks while data copies proceed in parallel in another part of the machine. Also, zero-copy operations reduce the number of time-consuming mode switches between user space and kernel space. System resources are utilized more efficiently since using a sophisticated CPU to perform extensive copy operations, which is a relatively simple task, is wasteful if other simpler system components can do the copying.

    As an example, reading a file and then sending it over a network the traditional way requires four data copies and four CPU context switches, if the file is small enough to fit in the file cache. Two of those data copies use the CPU. Sending the same file via zero copy reduces the context switches to two, and eliminates either half, or all CPU data copies.[1]

    Zero-copy protocols are especially important for high-speed networks in which the capacity of a network link approaches or exceeds the CPU's processing capacity. In such a case the CPU spends nearly all of its time copying transferred data, and thus becomes a bottleneck which limits the communication rate to below the link's capacity. A rule of thumb used in the industry is that roughly one CPU clock cycle is needed to process one bit of incoming data.


    Read full article from Zero-copy - Wikipedia, the free encyclopedia


    Efficient data transfer through zero copy



    Efficient data transfer through zero copy
    Many Web applications serve a significant amount of static content, which amounts to reading data off of a disk and writing the exact same data back to the response socket. This activity might appear to require relatively little CPU activity, but it's somewhat inefficient: the kernel reads the data off of disk and pushes it across the kernel-user boundary to the application, and then the application pushes it back across the kernel-user boundary to be written out to the socket. In effect, the application serves as an inefficient intermediary that gets the data from the disk file to the socket.
    Each time data traverses the user-kernel boundary, it must be copied, which consumes CPU cycles and memory bandwidth.

    Applications that use zero copy request that the kernel copy the data directly from the disk file to the socket, without going through the application. Zero copy greatly improves application performance and reduces the number of context switches between kernel and user mode.
    The Java class libraries support zero copy on Linux and UNIX systems through the transferTo() method in java.nio.channels.FileChannel. You can use the transferTo() method to transfer bytes directly from the channel on which it is invoked to another writable byte channel, without requiring data to flow through the application.

    If you re-examine the traditional scenario, you'll notice that the second and third data copies are not actually required. The application does nothing other than cache the data and transfer it back to the socket buffer. Instead, the data could be transferred directly from the read buffer to the socket buffer.

    The transferTo() method transfers data from the file channel to the given writable byte channel. Internally, it depends on the underlying operating system's support for zero copy; in UNIX and various flavors of Linux, this call is routed to the sendfile() system call, which transfers data from one file descriptor to another:

    Listing 1. Copying bytes from a file to a socket
    File.read(fileDesc, buf, len);
    Socket.send(socket, buf, len);
    Although Listing 1 is conceptually simple, internally, the copy operation requires four context switches between user mode and kernel mode, and the data is copied four times before the operation is complete.
    Figure 1. Traditional data copying approach
    Traditional data copying approach
    Figure 2 shows the context switching:
    Figure 2. Traditional context switches
    Traditional context switches


    Figure 3. Data copy with transferTo()
    Data copy with transferTo()
    Figure 4 shows the context switches when the transferTo() method is used:
    Figure 4. Context switching with transferTo()
    Context switching when using transferTo()
    The transferTo() method causes the file contents to be copied into a read buffer by the DMA engine. Then the data is copied by the kernel into the kernel buffer associated with the output socket.
    The third copy happens as the DMA engine passes the data from the kernel socket buffers to the protocol engine.
    This is an improvement: we've reduced the number of context switches from four to two and reduced the number of data copies from four to three (only one of which involves the CPU). But this does not yet get us to our goal of zero copy. We can further reduce the data duplication done by the kernel if the underlying network interface card supports gather operations. In Linux kernels 2.4 and later, the socket buffer descriptor was modified to accommodate this requirement. This approach not only reduces multiple context switches but also eliminates the duplicated data copies that require CPU involvement. The user-side usage still remains the same, but the intrinsics have changed:
    The transferTo() method causes the file contents to be copied into a kernel buffer by the DMA engine.
    No data is copied into the socket buffer. Instead, only descriptors with information about the location and length of the data are appended to the socket buffer. The DMA engine passes data directly from the kernel buffer to the protocol engine, thus eliminating the remaining final CPU copy.
    Figure 5 shows the data copies using transferTo() with the gather operation:
    Figure 5. Data copies when transferTo() and gather operations are used
    Data copies when transferTo() and gather operations are used

    Read full article from Efficient data transfer through zero copy

    8 Chrome Extensions To Supercharge Your Omnibox Searches



    8 Chrome Extensions To Supercharge Your Omnibox Searches

    Without extensions you can bookmark pages, perform basic calculations and conversions, add various search engines for better search results, and receive alerts (security, pop-ups, extensions). In this post, we are interested on how to further enhance the Omnibox to help speed up everyday tasks and searches with the help of extensions. Let's take a look at Chrome extensions that can power more customized and optimized searches for better results. 1. Better Omnibox Better Omnibox is an extension that improves the relevance of your search results, thanks to a combination of your bookmarks and search history. To use Better Omnibox, type "#", press tab, and then type your search query. When you select one of the results displayed below the Omnibox, that result will rank higher in future queries. Usage: # [TAB or space] [search query] 2. Omnibox Site Search Are you looking for an easy way to perform a search site for any website that you visit? With Omnibox Site Search ,

    Read full article from 8 Chrome Extensions To Supercharge Your Omnibox Searches


    Exporting Result Sets - Apache Solr Reference Guide - Apache Software Foundation



    Exporting Result Sets - Apache Solr Reference Guide - Apache Software Foundation

    Meta-Documentation The cases where this functionality may be useful include: session analysis, distributed merge joins, time series roll-ups, aggregations on high cardinality fields, fully distributed field collapsing, and sort based stats. Field Requirements All the fields being sorted and exported must have docValues set to true. For more information, see the section on  DocValues . Defining the /export Request Handler To export the full sorted result set you'll want to use a request handler explicitly configured to only run the "query" component, using the the export "rq" and "wt" params. An " /export " request handler with the appropriate configuration is included in the  techproducts example  solrconfig.xml . If however, you would like to add it to an existing  solrconfig.xml , you can add a section like this: Note that this request handler's properties are defined as "invariants",

    Read full article from Exporting Result Sets - Apache Solr Reference Guide - Apache Software Foundation


    Dev Time: Prefix and Suffix Matches in Solr



    Dev Time: Prefix and Suffix Matches in Solr

    Prefix and Suffix Matches in Solr Search engines are all about looking up strings. The user enters a query term that is then retrieved from the inverted index. Sometimes a user is looking for a value that is only a substring of values in the index and the user might be interested in those matches as well. This is especially important for languages like German that contain compound words like Semmelknödel where Knödel means dumpling and Semmel specializes the kind. Wildcards For demoing the approaches I am using a very simple schema. Documents consist of a text field and an id. The configuration as well as a unit test is also vailable on Github . id "); //]]>

    UCLA Knowledge Base : Why are Lucene's stored fields so slow to access



    UCLA Knowledge Base : Why are Lucene's stored fields so slow to access

    Browse By: Problem I have a Lucene index that has some large fields (about 50 KB each) and some small fields (about 50 bytes each). I need to access (iterate) one of the small fields for say 1/10 of the documents. For some reason, such operation is very slow, unreasonably so for such a small field. Cause Lucene provides a number of "policies" of how to access fields of a document. (See class org.apache.lucene.document.FieldSelector .) They specify when and how fields are loaded from the index. It turns out that the default is to load all fields in the document as soon as a Document is requested by, say IndexReader. (See class org.apache.lucene.index.FieldsReader, in particular, how it implements the doc(n, FieldSelector) function.) Therefore, when you load a small field, the large fields are also loaded, causing performance problem if you repeat the operation many times. Solution To use this policy, create a FieldSelector object.

    Read full article from UCLA Knowledge Base : Why are Lucene's stored fields so slow to access


    Dev Time: See Your Solr Cache Sizes: Eclipse Memory Analyzer



    Dev Time: See Your Solr Cache Sizes: Eclipse Memory Analyzer

    See Your Solr Cache Sizes: Eclipse Memory Analyzer Solr uses different caches to prevent too much IO access and calculations during requests. When indexing doesn't happen too frequently you can get huge performance gains by employing those caches. Depending on the structure of your index data and the size of the caches they can become rather large and use a substantial part of your heap memory. In this post I would like to show how you can use the Eclipse Memory Analyzer to see how much space your caches are really using in memory. Configuring the Caches All the Solr caches can be configured in solrconfig.xml in the query section. You will find definitions like this: This is an example of a filter cache configured to use the FastLRUCache , a maximum size of 8000 items and no autowarming. Solr ships with two commonly used cache implementations, the FastLRUCache LRUCache ,

    Read full article from Dev Time: See Your Solr Cache Sizes: Eclipse Memory Analyzer


    Labels

    Algorithm (219) Lucene (130) LeetCode (97) Database (36) Data Structure (33) text mining (28) Solr (27) java (27) Mathematical Algorithm (26) Difficult Algorithm (25) Logic Thinking (23) Puzzles (23) Bit Algorithms (22) Math (21) List (20) Dynamic Programming (19) Linux (19) Tree (18) Machine Learning (15) EPI (11) Queue (11) Smart Algorithm (11) Operating System (9) Java Basic (8) Recursive Algorithm (8) Stack (8) Eclipse (7) Scala (7) Tika (7) J2EE (6) Monitoring (6) Trie (6) Concurrency (5) Geometry Algorithm (5) Greedy Algorithm (5) Mahout (5) MySQL (5) xpost (5) C (4) Interview (4) Vi (4) regular expression (4) to-do (4) C++ (3) Chrome (3) Divide and Conquer (3) Graph Algorithm (3) Permutation (3) Powershell (3) Random (3) Segment Tree (3) UIMA (3) Union-Find (3) Video (3) Virtualization (3) Windows (3) XML (3) Advanced Data Structure (2) Android (2) Bash (2) Classic Algorithm (2) Debugging (2) Design Pattern (2) Google (2) Hadoop (2) Java Collections (2) Markov Chains (2) Probabilities (2) Shell (2) Site (2) Web Development (2) Workplace (2) angularjs (2) .Net (1) Amazon Interview (1) Android Studio (1) Array (1) Boilerpipe (1) Book Notes (1) ChromeOS (1) Chromebook (1) Codility (1) Desgin (1) Design (1) Divide and Conqure (1) GAE (1) Google Interview (1) Great Stuff (1) Hash (1) High Tech Companies (1) Improving (1) LifeTips (1) Maven (1) Network (1) Performance (1) Programming (1) Resources (1) Sampling (1) Sed (1) Smart Thinking (1) Sort (1) Spark (1) Stanford NLP (1) System Design (1) Trove (1) VIP (1) tools (1)

    Popular Posts