hacker news with inline top comments    .. more ..    26 Feb 2016 Ask
home   ask   best   2 years ago   
1
Ask HN: Which tool do you use for API test automation?
2 points by Oras  24 minutes ago   discuss
2
Business co-founder making me uncomfortable
6 points by eruditely  2 hours ago   2 comments top 2
1
dsr_ 1 hour ago 0 replies      
If you're "an employee that was promoted and got 10%" but are not being paid a salary, it is entirely possible that you are not in a fair situation. My recommendation is that you concentrate on that: is this a partnership you want to be in?
2
ddorian43 1 hour ago 0 replies      
No salary and 10% ? Doesn't make sense to me.
3
Ask HN: What is the best way to learn JavaScript for a beginner?
8 points by joshcox  1 hour ago   4 comments top 4
1
sebastianxx 2 minutes ago 0 replies      
Practice as much as you can. Write lots of code. Try to solve problems. Don't be afraid to reach out to people and ask questions. The best way to learn Javascript and programming is by doing it. JUST DO IT! As for books I'd recommend "Head First JavaScript". They're great for beginners. If you're comfortable with screencast try Learn Javascript in 14 days course on iLoveCoding and then move on to the lessons https://ilovecoding.org/lessons.

If you get stuck, ask questions on Stack Overflow or Reddit. Good luck

2
maxblackwood 20 minutes ago 0 replies      
Eloquent Javascript. It's one of the best introductions to programming I've ever read. http://eloquentjavascript.net/
3
santiagobasulto 1 hour ago 0 replies      
Don't go after "learn javascript". Try to "learn to program" first. The only way to learn anything is BY DOING. You have to sit and code. Are you going to play great basketball just watching the NBA? NO! You have to go outside and play. In the world of coding that translates to: sit and code.

Online resources, there are many. Too many sometimes. Grab anything from codecadeamy, codeschool, or Tree House. But remember that's not the only thing you need to know.

If you check my profile, we do remote programming courses where people work together with a real teacher for 6 weeks. We offer scholarships (100% free).

Remember:* Focus on learning programming. What's the scope of a variable? what's immutability? etc.* Practice a lot. Code as much as you can.* Look up for a group to work together.

4
daniel27 46 minutes ago 0 replies      
4
Ask HN: Should each of your products register as their own business?
17 points by itsthisjustin  12 hours ago   5 comments top 2
1
patio11 11 hours ago 2 replies      
Define "normal." I have three LLCs (four if you count one in Japan for purposes of being able to pay myself on payroll now that Starfighter exists); that's probably on the high end among most of my peers. Most small software companies have a single entity and only choose to spin out when a new product becomes a truly independent operational unit, when it receives investment, or (for branding purposes) if it ends up eating the business that spawned it.

Reasons to segregate:

1) The single biggest one is that it firewalls the liability of the businesses from each other. Whether this is important or not for you depends on what the businesses are doing: if it's Regular Internet Stuff then your E&O policy is probably good enough in terms of risk mitigation, but if 1+ of your products are in highly regulated spaces (hello HIPAA, finance, etc) then putting them in their own LLC isn't a crazy solution.

2) If you're religious about doing not just the paper ownership but the business accounts separately for each business, that makes eventually selling or otherwise disposing of them much, much easier. Otherwise you're looking at weeks of work and/or very fun professional services bills when you decide to do the division later.

3) If you have co-founders or investors, or the prospect of getting co-founders or investors, separate legal entities are going to be pretty much required. You don't want them to accidentally get ownership of your side projects; they don't want to own your side projects (ownership is a risk; they know the risks they're signing up for and don't want additional sources of uncontrolled unknown risk).

4) A minor factor, but there is non-zero social friction involved in "We've been talking about my trading name of $FOO but remember that the invoice/contract/etc will be from $BAR, LLC."

Reasons to not segregate:

1) It's a lot of extra work.

2) There's a running cost to keeping an LLC open, both the yearly fees and the operational complexity of maintaining separate books, accounts at various providers, and (if you're doing things in a complicated fashion) keeping up appearances with regards to the LLCs being formally separate from each other.

As an ex-consultant with some accidental knowledge of the payments space: I would be doing double-plus firewalling between any payments startup and anything I'd lose sleep about losing, and I would be happily writing a sizable check right about now to a lawyer rather than taking HN's advice about my compliance obligations and potential sources of risk.

2
mesozoic 12 hours ago 1 reply      
I wouldn't worry about it until you have assets in one entity to protect by having separate LLCs
5
Ask HN: How does your team write documentation? What tools do you use?
64 points by brwr  2 days ago   90 comments top 44
1
skewart 2 days ago 1 reply      
I really like these "how do other people do X?" questions on HN. Thanks for asking it!

I work at a small startup with a roughly 10-person eng team.

When we write docs we focus mainly on architecture and processes. The architecture docs often emerge from a "tech spec" that was written for the development of a feature that required a new service, or substantial changes to a new one. We keep everything in Github, in a README or other markdown files.

We also write API docs for HTTP endpoints. These are written with client developers and their concerns in mind. When doing this for a Rails app we use rspec_API_documentation, which is nice, but it can be annoying to have testing and documentation so tightly couples. We've talked about changing how we do this, but we always have more pressing things to do.

We never write docs for classes or modules within an app/service.

2
MalcolmDiggs 2 hours ago 0 replies      
We tend to thoroughly document our API (which is the backend behind our mobile apps and website) using ApiDocJs.com or Swagger.io/swagger-ui Every service and method is thoroughly explained in detail so the front-end folks have a reference to work off of.

The rest of the systems are documented ad-hoc. Some readme files here and there, a large block of comments inside of confusing files, the occasional style guide, etc.

We also have an onboarding guide for new devs (just a PDF) which walks them through our systems, our tools, etc. Nothing fancy, about 10 pages.

3
azdle 2 days ago 3 replies      
All of our docs a written in Markdown in a git repo [1]. That then gets built with a custom static site generator that I wrote [2]. Finally the output gets pushed back to gh for hosting on gh-pages [3].

I'm actually pretty proud of the search that I put together for this setup too, it's all done in the browser and the indexes are built at compile time which is then downloaded in full for a search, which sounds silly, but it works surprisingly well [4].

[1] https://github.com/exosite/docs/

[2] https://github.com/exosite/docs/blob/master/gulpfile.js

[3] http://docs.exosite.com

[4] http://docs.exosite.com/search.html?q=subscribe

4
tvvocold 2 days ago 0 replies      
We use flatdoc and Swagger UI for building docs, like: https://open.coding.net

flatdoc is a small JavaScript file that fetches Markdown files and renders them as full pages: https://github.com/rstacruz/flatdoc

Swagger UI is a dependency-free collection of HTML, Javascript, and CSS assets that dynamically generate beautiful documentation from a Swagger-compliant API. http://swagger.io

5
douche 2 days ago 2 replies      
Fucking Word docs. Which are checked into source control, except people (who are nominally developers, or project managers who were supposed to have been developers once) insist on versioning in ye olde rename-and-add-a-number style. With PDFs that are manually generated by exporting said Word documents from Word, and then again checked in, and again checked in in multiple renamed versions. Except sometimes only the PDF is checked in, without the source document...

So we have a doc folder in the repo that is like staring into the maw of Cthulhu and takes up 90% of our build time on the CI server sucking that down mass of garbage for the checkout.

Saner systems have been proposed, but rejected because the powers that be are too averse to change...

6
imrehg 2 days ago 1 reply      
Word docs converted into PDF for manuals. Some other are hand-crafted Photoshop tables/text/graphics to create PDF. Sad, sad stuff, IMHO.

Trying to get people onto Sphinx [0], and use it for some non-sanctioned documentation with good success, but unlikely to make it official.

I really think version control is important: what changed, who changed it, provisional changes through branches, and removing the bottleneck of "I updated the docs, please everyone check before release and send me your comments". It should be patches, and only patches.

[0]: http://sphinx-doc.org/

7
ericclemmons 2 days ago 1 reply      
Trying something new on this month's project: "developer first experience".

Besides the README.md to get started, the app defaults to a private portal with a component playground (for React), internal docs (for answering "how do I"), and tools for completely removing the need for doc pages at all.

I believe that documentation has to be part of the workflow, so component documentation should be visible while working on the component, tools for workflow should have introductions and helpful hints rather than being just forms and buttons, etc.

So far, this is proving fruitful.

(Side note: wikis are where docs go to die.)

8
intrasight 2 days ago 2 replies      
The first software system that I worked on was the operator consoles for a nuclear power plant. A two year long dev project. We used Framemaker (1990, before Adobe purchased them). Was an awesome tool for technical documentation. Our documentation when printed and bound was three feet wide on a shelf. I think I contributed two inches. It's been all downhill since - both in terms of the tools and the quality of the documentation. Now days it's the typical - auto-gen from code plus markdown for narrative.
9
chris_engel 2 days ago 0 replies      
Because I was not happy with the existing stuff, I've built an opensource project for creating technical online documentations some time ago, named "docEngine". My goals were:

- Easy editing (namely markdown files in folders)- Runs on "cheap" hosting/everywhere (built with PHP)- Supports multiple languages (so you can create docs in english, german, etc.)- Can have editable try-on-your-own demos embedded into the documentation- SEO friendly (clean URLs and navigation structure)- Themeable (themes are separated and run with the Twig templating engine)- Works on mobiles out of the box- Supports Plugins/Modules for custom content/behaviour- Formats reference pages for objects/classes/APIs in a nice way- Supports easy embedding of disqus for user feedback- Other stuff I forgot right now

The system powers the knowledge base of my recent app "InSite" for web developers: https://www.insite-feedback.com/en/help

You can see it also in action working - with a different theme - for my javascript UI library "modoJS": http://docs.modojs.com

That page is a bit more complex. It does _not_ use multiple languages there but it makes great use of the reference pages and has many many editable live-demos. It also has some custom modules like a live build script for the javascript library. At one point it even had a complete user-module with payments but I disabled that when modoJS went opensource.

Another instance of docEngine runs for my pet html5 game engine: http://wearekiss.com/gamekitThis one uses the default theme, has most pages in two languages and again incorporates a couple of live demos.

I host a little documentation about the engine itself here, but its not complete right now: http://wearekiss.com/docEngineYou can also find the github link to the project in the footer of every hosted documentation.

Have fun with it - I give it away for free. Critics and comments welcome!Everything I have linked was built by myself.

10
angersock 2 days ago 1 reply      
Long ago I learned to love wikis. Mediawiki, Dokuwiki (easy to set up), or Confluence. Hardest part is to keep people from just throwing garbage everywhere--if that happens, people stop referring to the docs, and the system collapses.

The important thing about docs is to keep in mind the audience. This is important because it lets you estimate their mental model and omit things that are redundant: for example, if it's internal documentation for a codebase, there is little need to explicitly list out Doxygen or JSDoc style information, because they have access to the damned source code. External audiences may need certain terms clarified, or some things explained more carefully because they can't just read the source.

I'd say that the biggest thing missing in the documentation efforts I've seen fail is the lack of explanation for the overarching vision/cohesive architecture of the work. This is sometimes because there isn't a single vision, or because the person who has the vision gets distracted snarfing on details that are not helpful without a preexisting schema to hang them on when learning. So, always always always have a high-level document that describes the general engineering problem the project solves, the main business constraints on that solution, and a rough sketch of how the problem is solved.

Ideally, the loss of the codebase should be less of a setback than the loss of the doc.

I will say that, as your documentation improves, you will hate your project more and more--this is the nature of the beast as you drag yourself through the broken shards of your teams engineering.

11
vacri 2 days ago 3 replies      
We used to use a Mediawiki wiki, which only I would edit. You kind of have to be comfortable with mediawiki syntax (does the job for everything but tables, which suck). So we moved to Confluence, which has a WYSIWYG editor, to encourage more people to document things, upload documents, so on and so forth. Again, I am the only one editing it... so our documentation is "very occasionally write something down, and store it on your laptop or in your private google drive, then spend ages searching for it when someone asks".

So whenever a new staffer comes along, I get asked to give them wiki access... but I'm the only one here that uses my edits (only ops staffer). Sure, have some wiki access, for all the good it will do you!

I really don't recommend our model :)

Anyway, this is an important point: documentation is not free. It takes time. Even shitty documentation takes time. If you want good documentation, you need to budget time away from other tasks. When I used to work in support, the field repair engineers would budget 30% of their hours for doing paperwork - not documentation specifically, but it clearly shows that 'writing stuff' is not something that springs as a natural/free parallel to other activity.

12
kenOfYugen 2 days ago 0 replies      
I enjoy Literate CoffeeScript and that's where I picked up the concept of Literate coding.

I believe that literate style of code writing has many benefits in any language.

Basically mix markdown with the codebase and export the documentation from the same file.

For a very well executed and interactive example check out

http://dave.kinkead.com.au/modelling-the-boundary-problem/

13
mixmastamyk 2 days ago 0 replies      
Sphinx or mkdocs:

http://www.sphinx-doc.org/en/latest/

http://www.mkdocs.org/

Which make it easy to create html, pdf, epub, latex formats, etc.

I like to create a user guide, developer guide, and ops guide for each large project.

14
someguydave 2 days ago 1 reply      
Our APIs are documented with comments that Sphinx uses to generate HTML documents. Unfortunately, all of our other documentation is written in Microsoft products because "that's what people use"
15
BooneJS 10 hours ago 0 replies      
Adobe FrameMaker and Microsoft Visio, stored in Perforce.

Beautiful documents, but it takes a decent chunk of time to create. We do extract some docs via XML to generate code, somewhat backwards from how most engineers merge docs and code.

16
buremba 2 days ago 0 replies      
We use Swagger specification (automatically generated using annotations in Java) and generate Slate documentation from Swagger specification for API documentation. (https://api.rakam.io/). We also use markdown for generic (tutorials, technical etc.) documentation and render the markdown files fetched from Github in documentation page using JS. Since everything is dynamic, we don't need to worry about updating the documentation page, we just update README files of repositories, add documents to our documentation repository and the documentation page is always up-to-date. (https://rakam.io/doc/).
17
NearAP 2 days ago 1 reply      
We have technical writers who work in conjunction with developers to author the documentation. I don't know what tool they use.However, since you say you want to get better at writing docs, let me offer some perspectives based on a user of documentation.

1) Write to all of your target audience. For example if your product is targeted at both technical and non-technical people, then write the documentation in such a way that non-technical folks can understand it. Don't just write for the technical people.

2) If possible, write documentation around several 'how do I do XYZ task'? My experience has been that people tend to turn to documentation when they want to execute a specific task and they tend to search for those phrases

3) As much as is possible, include examples. This tends to remove ambiguities.

18
kakwa_ 1 day ago 0 replies      
At work, I've seen a variety of solutions, depending on the teams I work with:

* MS doc(x) on a network folder with an excel spreadsheet to keep track of docs (and a lot of ugly macros).

* MS doc(x) in a badly organized subversion repository (side note here, docs comments and revision mode are heavily used in those contexts, which is really annoying)

* rst + sphinx documentation in a repository to generate various outputs (html, odt, pdf...) depending on the client.

In some cases we also use Mako (a python template engin) before sphinx to instantiate the documentation for a specific platform (ex: Windows, RedHat, Debian...), with just a few "if" conditions (sphinx could do it in theory, but it's quite buggy and limited).

I've also put in place a continuous build system (just an ugly shell script) rebuilding the sphinx html version every commit (it's our "badly implemented readthedocs.org", but it's good enough for our needs).

In other cases we use specification tools like PowerAMC or Eclipse/EMF/CDO based solutions, the specification tool in that case works on a model, and can generate various outputs (docx, pdf, rtf, html...).

At home, for my personal projects, I use rst + sphinx + readthedocs, or if the documentation is simple, just a simple README.md at the root of my repository.

As a personal opinion, I like to keep the documentation close to the code, but not too close.

For example, I find it really annoying when the sole documentation is doxygene (or equivalent), it's necessary to document each public methods/attributes individually, but it's not sufficient, you need to have a "bigger picture documentation" on how stuff works together (software and system architecture) in most cases.

On the other side, keeping the documentation away from the code (in a wiki or worst) doesn't work that well either, it's nearly a guaranty that documentation will be out of date soon, if it's not already the case.

I found having a doc directory in the source code repository a nice middle ground.

I found wikis annoying in most cases, rarely up to date, badly organized and difficult to version coherently and properly (ex: having version of the doc matching the software version).

19
nahtnam 2 days ago 0 replies      
Elixir has a great documentation system built in. I use that.
20
drygh 2 days ago 0 replies      
At Ionic, we use Dgeni (https://github.com/angular/dgeni) for API docs. We have a few custom build tasks that allow us to version the API docs.

We also have higher level documentation, which is meant to serve as a sort of conceptual overview of the framework, as well as to show what the framework comes with out of the box. This section is written mostly in kramdown, which gets parsed by jekyll before it's turned into HTML.

21
Tharkun 2 days ago 0 replies      
Most of our documentation attention goes towards the user manual and the system operator manual.

We generate the bulk of those manuals based on our object model, which is liberally sprinkled with (text only) descriptions. We've created a simple XML-based authoring framework which allows us to create pretty tidy documentation. Including images, tables, code examples etc.

We convert that XML to Apache FOP. At the end of the process, we're left with a bunch of tidy PDF manuals in a variety of languages.

22
hooliganpete 1 day ago 0 replies      
I work at a very large company so you won't be surprised to hear we use a variety of tools and there's often overlap. Almost everything goes to Confluence (our program wiki) including tech specs and marketing documentation. The product team often uses something simple, such as Quip to store and collaborate on their docs. Marketing tends to migrate toward Drive. I think the best advice I can offer is to keep one "source of truth". This isn't too difficult when your team is small but as you start to grow, having one place devs, marketing, sales can go really helps streamline things.
23
gravypod 2 days ago 0 replies      
The thing that has always guided me right is that you need to a) split up functions, b) document method headers in every case with a short description of what it does, and finally c) come back one month later and rewrite any documentation that does not make sense.

This is the most important step. If you cannot remember it from a blank slate, then no one can. Keep doing that until you understand the code at first glance. Then your code will be easy for anyone to maintain.

24
scottlocklin 2 days ago 2 replies      
LaTeX. We have academic roots, it works with source control, and the output looks fantastico.
25
mixedCase 2 days ago 1 reply      
A mardown-based wiki under version control and code comments. Everything else likely isn't worth docummenting or just merits direct person-to-person communication.
26
tamersalama 2 days ago 1 reply      
This question is on my mind too. My clients documentations are usually a mix of MS Word & Visio. Lots of repetition and gunk in between.

Ideally, I'd love to find a mechanism that:

 - provides the OO principles in documents; Encapsulations, Abstraction, Polymorphism, Inheritance . - Accessible & maintainable by non-techies. - Allows scripting (I toyed with PlantUML, but it was a bit rigid).

27
afarrell 2 days ago 1 reply      
Not on a team, but I used mkdocs for this tutorial I built, then added a comment system that I built with react.js : https://amfarrell.com/saltstack-from-scratch/ The advantage of mkdocs is that it is markdown-based so it is super easy to get started.
28
acesubido 2 days ago 1 reply      
Gitbook for Technical Documents, Google Drive for everything else.
29
davidjnelson 1 day ago 0 replies      
The most valuable docs for me are rest Api contracts stored in confluence. Easy to collaborate on. Also, getting started guides in confluence for new hires, and architectural diagrams again in confluence for cross team collaboration / understanding / discussion.

As for code, auto generated docs from jsdoc etc. headers are fine but I never use them honestly. I find unit tests to be the ultimate documentation in terms of code level docs.

30
tmaly 1 day ago 0 replies      
This is a problem I am struggling with right now.

I have a CVS repository of PDF and Word docs.

The business side uses docx format, so using markdown and generating docx is not really feasible. I have run into issues of people changing the filename and it creating a new entry in the version control. I have a idea I plan to implement to fix this.

What I would really like is some linux system that would make it easy to pull the text out of docx and make it searchable. I would want something that could run on the command line that does not have a ton of dependencies.

31
girzel 2 days ago 0 replies      
The Texinfo format, using the in-Emacs Info browser. Yes, it means you only read your documentation inside Emacs, but it's hands-down the best doc-browsing experience I've ever had. Hell to write, butter to read.
32
darkFunction 2 days ago 1 reply      
Bitbucket's wiki on our project page (6-person startup). We document mostly application behaviour for technical users of the app (server team, content writers) and a little bit of architecture if the complexity warrants it.
33
ddasayon 2 days ago 1 reply      
We write the docs as markdown files and then use Doxygen to compile it to html and LaTeX for the traditional folks who MUST have a printable document. The markdown files are tracked on Git so that we can collaborate and track easily.
34
rusbus 2 days ago 1 reply      
Shameless plug:I'm working on a documentation solution for dev teams. You can sign up for the beta at http://docily.com/
35
DannoHung 2 days ago 2 replies      
Related: what's the right way to extract inline comments regarding function API stuff from source code?

This seems like something that is a really good idea, but is hard to find any projects for it.

36
irixusr 1 day ago 0 replies      
I work for the government right now.

I'm trying to gather a community of git supporters to push for git.

However, after three months I still haven't gotten a computer suitable for my job.

37
mbrock 2 days ago 1 reply      
We barely have any documentation except some READMEs that are mostly terse and still poorly maintained... If you don't understand something, you ask someone.
38
quasiben 2 days ago 0 replies      
All of folks at Continuum Analytics use sphinx and readthedocs
39
gault8121 2 days ago 0 replies      
HN, we are writing our high level overviews as Readme MD files. Any ideas on how we could help condense this info for open source contributors?
40
adnanh 2 days ago 0 replies      
We wrote our custom documentation generator for Grape (ruby), something like Swagger, but less rigid.
41
arisAlexis 2 days ago 3 replies      
My boss decided to use Framemaker with DITA in 2016..
42
zolokar 2 days ago 0 replies      
A combination of Github Wikis and a Dozuki site.
43
barile 22 hours ago 0 replies      
swagger.io for the apis + README.md on each service's repo
44
adityar 1 day ago 0 replies      
doxygen
6
Ask HN: Best Object-Oriented Programming Book
7 points by Kinnard  14 hours ago   1 comment top
1
stevenspasbo 8 hours ago 0 replies      
Check out the Head First: Object-Oriented Analysis and Design. It's pretty basic, but would be a good intro if you're brand new to OO.
7
Ask HN: Someone is stealing things from my car. What security camera would help?
2 points by hoodoof  2 hours ago   1 comment top
1
davismwfl 8 minutes ago 0 replies      
The ones I can think of that would be good for in car would be pretty pricey to risk being stolen themselves. However, you might be able to find a small 5v camera (and use an adapter from the cars accessory port) that will capture stills on motion for a reasonable price, maybe try tiger direct or monoprice, newegg etc. Then just hide it reasonably well in the car, rear trunk deck or above the rearview mirror would be ideas -- use colored tape to make it blend into the interior. To transmit the image as well as store local that is a little more involved. Outside of something like a nest camera, you could use a RaspberriPi with camera and have it connected via wifi to your house so you could get the images. It would be a little tougher to conceal, but still very doable.

Obviously I think just locking your doors (if you don't already) is the smartest idea. And setup a camera outside, at least that way you can start to deter this behavior and possibly catch a glimpse if it does happen. I'd also bet if this is happening multiple times, it is someone you know, or a local teenager etc. That's what happened to me when I found my car getting egged etc repeatedly one summer, I setup a system to catch them which allowed me to have a little discussion with them to make it stop before it escalated.

A little fun too, if you only use the outside camera, put a sign in the car that says "don't look back" or something to that affect (to get them to look towards the camera). Most people when they read that immediately do the behavior you wrote to see why, and they'll look right at the camera. Only the real criminals will just bolt and not do it.

8
What rack-mountable multiple-ARM servers are there out there?
3 points by mikaelm  8 hours ago   2 comments top 2
1
CyberFonic 6 hours ago 0 replies      
Well you'd think that it wouldn't be that hard to create a blade server style system using Raspberry Pi Compute Modules. A 256 node, 1U server might just be possible. Of course, the power supply, cooling and LAN fabric would be an interesting challenge.

Rather than 64 bit and ECC RAM, you could have high redundancy on the module level. AFAIK Google do not use server grade systems, just lots of them in a failure tolerant configuration.

2
mikaelm 8 hours ago 0 replies      
Ah just to be clear, I would like the individual SoC units to have as little firmware as possible, for the security of the computation.

An important part of the goal is to get an as "closed computational environment" as possible, where risk for BIOS/firmware infection by hacker is minimal.

So just CPU, ECC RAM, ethernet, and microSD (or USB) to boot off.

9
Chrome says login.live.com is a Deceptive site
6 points by whizzkid  15 hours ago   discuss
10
Tell HN: HN and Slack Office Hours with YC Partners this Friday
39 points by kevin  2 days ago   7 comments top 2
1
kevin 2 days ago 1 reply      
Jared will also be doing open office hours on Slack from 2-4pm PT on Friday (Feb 26). If you'd like help with your startup, but want your questions answered in a private setting, sign up here by end of day on Feb 23:

https://apply.ycombinator.com/events/13.

2
minimaxir 2 days ago 1 reply      
"Tell HNs" no longer appear in the HN front page, which seems unintentional given announcements such as these.
11
Ask HN: So Marissa failed to revive Yahoo what would YOU have done different?
6 points by hoodoof  7 hours ago   4 comments top 4
1
rufusjones 1 hour ago 0 replies      
There's a scene in COAL MINER'S DAUGHTER where the producer halts the recording session after Loretta Lynn starts singing and says "We need to get some more pickers."

"What do you mean more?" her husband says. "I can't afford more."

"I mean more better," the producer replies.

The problem isn't so much WHAT Mayer did, but how badly she did it. She wanted to boost Yahoo's video division, for example, so she (a) paid Katie Couric (someone my Mom really liked) more money than God, (b) licensed the streaming rights to SATURDAY NIGHT LIVE (which needs to be put out of its misery) and (c) decided to fund a season of COMMUNITY (which nobody ever watched). Now THERE'S a compelling product offering. WhoTF would watch that?

She redesigned the site to make it mobile-friendly-- which made it almost unreadable for desktop users. Apparently no one told her that browsers send information about what type of device is reading the site. I liked the tech news, but it became so painful that I just gave up.

(She also killed a lot of the personalization features, which is what had attracted people.)

She bought a lot of tiny companies, most of which made products that were never integrated-- and whose employees left as soon as their employment agreements expired.

She bought Tumblr... then had absolutely no idea what to do with it, and ended up not doing anything. Shut down their sales force to integrate it with the Yahoo sales force (which didn't want to sell Tumblr)-- and then realized they'd have to restart the Tumblr sales force to monetize at all.

She inherited an email system that was a hot mess and made it a different type of hot mess-- with enough glitches that people couldn't rely on it for their email.

A lot of the ideas were reasonable, but the execution was so horrific that there was never any value-add.

2
danieltillett 6 hours ago 0 replies      
I would have done exactly what Ms Mayer did which is take the cash and run. What a scam.
3
codeonfire 4 hours ago 0 replies      
http://www.design.caltech.edu/erik/Misc/Prepare_3_Envelopes....

Why is Yahoo so politicized and who cares? I think Yahoo is no different from any other large corporation. It's just that the players are engaging and using the media a lot more for their stupid rich asshole power struggles that 99% of earth doesn't care about.

4
tdhz77 7 hours ago 0 replies      
Make the website minimalist. Change the colors, focus on journalism. Compete with Huffington Post and AOL.
12
Ask HN: What is your experience with running Hacklang on production?
4 points by andreygrehov  10 hours ago   discuss
13
Ask HN: Could browsers do the VirtualDOM React do?
2 points by lucio  13 hours ago   discuss
14
Ask HN: What companies have/had good engineering blogs?
22 points by ambertch  1 day ago   17 comments top 17
3
CiPHPerCoder 18 hours ago 0 replies      
If you're into PHP programming, application security, and/or cryptography:

https://paragonie.com/blog/category/security-engineering

4
sumodirjo 1 day ago 0 replies      
Collection of engineering blogs : https://github.com/sumodirjo/engineering-blogs/
6
DustinLessard 23 hours ago 0 replies      
Workiva Techblog https://techblog.workiva.com/ has become a favourite of mine recently.
9
whatismybrowser 1 day ago 0 replies      
Etsy's tech blog: https://codeascraft.com/ is excellent.

They got me on to monitoring EVERYTHING with statsd. Great stuff.

11
kaizensoze 1 day ago 0 replies      
ahemhttps://github.com/kilimchoi/engineering-blogs

Edit: Too bad you can't use asterisks in HN comments...

13
147 1 day ago 0 replies      
One of my favorite ones is Instagram's: http://instagram-engineering.tumblr.com/
14
kachhalimbu 1 day ago 0 replies      
Auth0 blog is pretty nice if you are into JavaScript and Security https://auth0.com/blog
16
perseusprime11 1 day ago 0 replies      
Netflix is a good one. But remember most of them won't get you anywhere if you want to learn about their architectures. They are mostly used as a recruiting tool.
15
Ask HN: Alpine Linux as a Desktop?
5 points by smoyer  21 hours ago   1 comment top
1
jfkw 7 hours ago 0 replies      
Somewhat off-topic:

Fellow longtime Gentoo and recent Alpine user here. I haven't encountered undue conflict burden from configuration file updates. Some projects do churn whitespace etc, in configuration defaults files, which is unfortunate but not specific to any distro.

If an application supports a conf.d style override, I use that, containing only settings which differ from default.

Is there something inherent about Alpine packaging that handles local config differently?

16
Ask HN: How to handle 50GB of transaction data each day? (200GB during peak)
119 points by NietTim  1 day ago   76 comments top 33
1
ecaroth 1 day ago 1 reply      
Not an answer to your question, but just a quick note- this is the first post in a long while on HN where I appreciate both the problem you are looking to solve and the honesty/sincerity you have in saying that you are not perfectly qualified to solve it but you know those here can help. From all of us on the community watching and lurking, thanks for your candor so we can all learn from this thread!
2
haddr 1 day ago 4 replies      
First of all, 50GB per day is easy. Now, maybe contrary to what they say below, do the following:

* Don't use queues. Use logs, such as Apache Kafka for example. It is unlikely to lose any data, and in case of some failure, the log with transactions is still there for some time. Also Kafka guarantees order of messages, which might be important (or not).

* Understand what is the nature of data and what are the queries that are made later. This is crucial for properly modeling the storage system.

* Be careful with the noSQL cool-aid. If mature databases, such as postgreSQL can't handle the load, choose some NoSQL, but be careful. I would suggest HBase, but your mileage may vary.

* NoSQL DBs typically limits queries that you might issue, so the modelling part is very important.

* Don't index data that you don't need to query later.

* If your schema is relational, consider de-normalization steps. Sometimes it is better to replicate some data, than to keep relational schema and make huge joins across tables.

* Don't use MongoDB

I hope it helps!

3
mattbillenstein 1 day ago 5 replies      
First of all, ingest your data as .json.gz -- line delimited json that's gzipped -- chunk this by time range, perhaps hourly, on each box. Periodically upload these files to the cloud -- S3 or Google CloudStorage, or both for a backup. You can run this per-node, so it scales perfectly horizontally. And .json.gz is easy to work with -- looking for a particular event in the last hour? gunzip -c *.json.gz | grep '<id>' ...

Most of the big data tools out there will work with data in this format -- BigQuery, Redshift, EMR. EMR can do batch processing against this data directly from s3 -- but may not be suitable for anything other than batch processing. BigQuery and/or Redshift are more targeted towards analytics workloads, but you could use them to saw the data into another system that you use for OLAP -- MySQL or Postgres probably.

BigQuery has a nice interface and it's a better hosted service than Redshift IMO. If you like that product, you can do streaming inserts in parallel to your gcs/s3 uploading process for more real-time access to the data. The web interface is not bad for casual exploration of terabytes of raw data. And the price isn't terrible either.

I've done some consulting in this space -- feel free to reach out if you'd like some free advice.

4
nunobrito 1 day ago 2 replies      
We need to handle data in a similar level as you mention and also use plain text files as only reliable medium to store data. A recent blog: http://nunobrito1981.blogspot.de/2016/02/how-big-was-triplec...

My advice is to step away from AWS (because of price as you noted). Bare metal servers are the best startup friend for large data in regards to performance and storage. This way you avoid virtualized CPU or distributed file systems that are more of a bottleneck than advantage.

Look for GorillaServers at https://www.gorillaservers.com/

You get 40Tb storage with 8~16 cores per server, along with 30Tb of bandwidth included for roughly 200 USD/month.

This should remove the IOPS limitation and provide enough working space to transform the data. Hope this helps.

5
harel 1 day ago 3 replies      
Here are a few suggestions based on 6+ years in adtech (which have just came to a close, never again thank you):

* Use a Queue. RabbitMQ is quite good. Instead of writing to files, generate data/tasks on the queue and have them consumed by more than one client. The clients should handle inserting the data to the database. You can control the pipe by the number of clients you have consuming tasks, and/or by rate limiting them. Break those queue consuming clients to small pieces. Its ok to queue item B on the queue while processing item A.

* If you data is more fluid and changing all the time, and/or if it comes in JSON serializable format, consider switching to postgresql ^9.4, and use the JSONB columns to store this data. You can index/query those columns and performance wise its on par (or surpasses) MongoDB.

* Avoid AWS at this stage. like commented by someone here - bare metal is a better friend to you. You'll also know exactly how much you're paying each month. no surprises. I can't recommend Softlayer enough.

* Don't over complicate things. If you can think of a simple solution to something - its preferable than the complicated solution you might have had before.

* If you're going the queue route suggested above, you can pre-process the data while you get it in. If its going to be placed into buckets, do it then, if its normalised - do it then. The tasks on the queue should be atomic and idempotent. You can use something like memcached if you need your clients to communicate between eachother (like checking if a queue item is not already processed by another consumer and thus is locked).

6
TheIronYuppie 1 day ago 3 replies      
Disclaimer: I work at Google.

Have you looked at Google at all? Cloud Bigtable runs the whole of Google's Advertising Business and could scale per your requirements.

https://cloud.google.com/bigtable/docs/

7
lazyjones 1 day ago 0 replies      
I'm not sure I understand precisely what kind of data you are processing and in what way, but it sounds like a PostgreSQL job on a beefy server (lots of RAM) with SSD storage. Postgres is very good at complex queries and concurrent write loads and if you need to scale quickly beyond single server setups, you can probably move your stuff to Amazon Redshift with little effort. Wouldn't recommend "big data" i.e distributed setups at that size yet unless your queries are extremely parallel workloads and you can pay the cost.

In my previous job we processed 100s of millions of row updates daily on a table with much contention and ~200G size and used a single PostgreSQL server with (now somewhat obsoleted by modern PCIe SSDs) TMS RamSAN storage, i.e. Fibre-Channel based Flash. We had some performance bottlenecks due to many indexes, triggers etc. but overall, live query performance was very good.

8
zengr 1 day ago 0 replies      
Doing real time query for report generation for data growing by 50gb per day is a hard problem.

Realistically, this is what I would do (I work on something very similar but not really in adtech space):

1. Load data in text form (assuming it sits in S3) inside hadoop (EMR/Spark)

2. Generate reports you need based on your data and cache them in mysql RDS.

3. Serve the pre-generated reports to your user. You can get creative here and generate bucketed reports where user will fill its more "interactive". This approach will take you a long way and when you have time/money/people, maybe you can try getting fancier and better.

Getting fancy: If you truly want near-real time querying capabilities I would looks at apache kylin or linkedin pinot. But I would stay away from those for now.

Bigtable: As someone pointed out, bigtable is good solution (although I haven't used it) but since you are on AWS ecosystem, I would stick there.

9
wsh91 1 day ago 0 replies      
We're having a good time with Cassandra on AWS ingesting more than 200 GiB per day uncompressed. I don't know how you're running your IOPS numbers, but consider allocating large GP2 EBS volumes rather than PIOPS--you'll get a high baseline for not that much money. The provisos you'll see about knowing how you expect to read before you start writing are absolutely true. :)

(Hope that might be helpful! A bunch of us hang out on IRC at #cassandra if you're curious.)

10
yuanchuan 1 day ago 0 replies      
I once worked on similar project. Each day, the amount of the data coming in is about 5TB.

If your data are event data, e.g. User activity, clicks, etc, these are non-volatile data which should preserve as-is and you want to enrich them later on for analysis.

You can store these flat files in S3 and use EMR (Hive, Spark) to process them and store it in Redshift. If your files are character delimited files, you can easily create a table definition with Hive/Spark and query it as if it is a RDBMS. You can process your files in EMR using spot instances and it can be as cheap as less than a dollar per hour.

11
alexanderdaw 1 day ago 0 replies      
1. Stream your data into Kafka using flat JSON objects. 2. Consume your kafka Feeds using a Camus Map Reduce job (a library from linked in that will output hdfs directories with the data). 3. Transform the hdfs directories into usable folders for each vertical your interested in, think of each output directory as an individual table or database. 4. Use HIVE to create an "external table" that references the transformed directories. Ideally your transformation job will create merge-able hourly partition directories. Importantly you will want to use the JSON SERDE for your hive configuration. 5. Generate your reports using hive queries.

This architecture will get you to massive, massive scale and is pretty resilient to spikes in traffic because of the Kafka buffer. I would avoid Mongo / mysql like the plague in this case, a lot of designs focus on the real time aspect for a lot of data like this, but if you take a hard look at what you really need, its batch map reduce on a massive scale and a dependable schedule with linear growth metrics. With an architecture like this deployed to AWS EMR (or even kinesis / s3 / EMR) you could grow for years. Forget about the trendy systems, and go for the dependable tool chains for big data.

12
asolove 1 day ago 1 reply      
Read "Designing data intensive applications" (http://dataintensive.net/), which is an excellent introduction to various techniques for solving data problems. It won't specifically tell you what to do, but will quickly acclimate you to available approaches and how to think about their trade offs.
13
mindcrash 16 hours ago 0 replies      
You probably might want to read this (for free): http://book.mixu.net/distsys/single-page.html

And pay a little to read this book: http://www.amazon.com/Designing-Data-Intensive-Applications-...

And this one: http://www.amazon.com/Big-Data-Principles-practices-scalable...

Nathan Marz brought Apache Storm to the world, and Martin Kleppmann is pretty well known for his work on Kafka.

Both are very good books on building scalable data processing systems.

14
jamiequint 1 day ago 0 replies      
Consider using CitusData to scale out Postgres horizontally. You can shard by time and basically get linear speedup based on # of shards. Its extremely fast and will be open source in early Q2 I think. You then can put your Postgres instances on boxes with SSD instead of IOPS. Writes also scale mostly linearly.
15
pklausler 1 day ago 0 replies      
50GiB/day is less than a megabyte per second. Surely you wouldn't be bandwidth-limited on a real device, even consumer SSDs are in the 100-600 MiB/s range IIRC. Can you do anything to increase your bytes per IOP in your current environment if you're IOP-limited?
16
mslot 1 day ago 0 replies      
disclaimer: I work for Citus Data

The bottleneck is usually not I/O, but computing aggregates over data that continuously gets updated. This is quite CPU intensive even for smaller data sizes.

You might want to consider PostgreSQL, with Citus to shard tables and parallelise queries across many PostgreSQL servers. There's another big advertising platform that I helped move from MySQL to PostgreSQL+Citus recently and they're pretty happy with it. They ingest several TB of data per day and a dashboard runs group-by queries, with 99.5% of queries taking under 1 second. The data are also rolled up into daily aggregates inside the database.

There are inherent limitations to any distributed database. That's why there are so many. In Citus, not every SQL query works on distributed tables, but since every server is PostgreSQL 9.5, you do have a lot of possibilities.

Looking at your username, are you based in the Netherlands by any chance? :)

Some pointers:

- How CloudFlare uses Citus: https://blog.cloudflare.com/scaling-out-postgresql-for-cloud...

- Overview of Citus: https://citus-conferences.s3.amazonaws.com/pgconf.ru-2016/Ci...

- Documentation: https://www.citusdata.com/documentation/citusdb-documentatio...

17
exacube 1 day ago 0 replies      
If your data is growing at this rate (and you plan to keep this data around), you'd want a distributed database that can scale to terabytes. But it might be overkill if you dont care for data consistency (i.e., you dont need to read it "right away" after you do a write):

If you just want reports (and are okay getting them in the matter of minutes), then you can continue storing them in flat files and using apache HIVE/PIG-equivalent software (or whatever equivalent is hot right now, im out of date on this class of software).

If you want a really good out-of-box solution for storage + data processing, google cloud products might be a really good bet.

18
agnivade 1 day ago 0 replies      
Lots of good suggestions here. I won't say anything new but just wanted to stress on the data ingestion part.

DO NOT write to txt files and read them again. This is unnecessary disk IO and you will run into a lot of problems later on. Instead, have an agent which writes into Kafka (like everyone mentioned), preferably using protobuff.

Then have an aggregator which does the data extraction and analysis and puts them in some sort of storage. You can browse this thread to look for and decide what sort of storage is suitable for you.

19
lafay 1 day ago 0 replies      
We faced a very similar problem when we started Kentik two years ago, except in our case the "transactions" are network traffic telemetry that we collect from our customers' physical network infrastructure, and providing super-fast ad hoc queries over that data is our core service offering.

We looked at just about every open source and commercial platform that we might use as a backend, and decided that none were appropriate, for scale, maturity, or fairness / scheduling. So we ended up building, from scratch, something that looks a bit like Google's Dremel / BigQuery, but runs on our own bare metal infrastructure. And then we put postgres on top of that using Foreign Data Wrappers so we could write standard SQL queries against it.

Some blog posts about the nuts and bolts you might find interesting:

https://www.kentik.com/postgresql-foreign-data-wrappers/

https://www.kentik.com/metrics-for-microservices/

If we were starting today, we might consider Apache Drill, although I haven't looked at the maturity and stability of that project recently.

20
ermack 1 day ago 0 replies      
Its difficult to give answer without understanding of data processing you want.

If you need to generate rich multi-dimension reports I recommend you create ETL pipeline into star-like sharded database (ala OLAP).

Dimensions normalization sometime dramatically reduce data volume, most of dimensions even can fit into RAM.

Actually 200Gb per day not so much in terms of throughput, you can manage it pretty well on PostgreSQL cluster (with help of pg_proxy). I think mySQL will also work OK.

Dedicated hardware will be cheaper then AWS RDS.

21
foxbarrington 1 day ago 0 replies      
Here's what I've done for ~200GB/day. Let's pretend you have server logs with urls that tell you referrer and whether or not the visit action was an impression or a conversion and you want stats by "date", "referrer domain", "action":

* Logs are written to S3 (either ELB does this automatically, or you put them there)

* S3 can put a message into an SQS queue when a log file is added

* A "worker" (written in language of your choice running on EC2 or Lambda) pops the message off the queue, downloads the log, and "reduces" it into grouped counts. In this case a large log file would be "reduced" to lines where each line is [date, referrer domain, action, count] (e.g. [['2016-02-24', 'news.ycombinator.com', 'impression', 500], ['2016-02-24', 'news.ycombinator.com', 'conversion', 20], ...]

* The reduction can either be persisted in a db that can handle further analysis or you reduce further first.

22
stuartaxelowen 1 day ago 0 replies      
Check out LinkedIn's posts about log processing [0] and Apache Kafka. Handling data as streams of events lets you avoid spikey query-based processing, and helps you scale out horizontally. Partitioning lets you do joins, and you can still add databases as "materialized views" for query-ability. Add Secor to automatically write logs to S3 so you can feel secure in the face of data loss, and use replication of at least 3 in your Kafka topics. Also, start with instrumentation from DataDog or NewRelic from the start - it will show you the performance bottlenecks.

0: https://engineering.linkedin.com/distributed-systems/log-wha...

23
pentium10 1 day ago 0 replies      
Use BigQuery, here is a nice presentation how to get going and some uses cases that get's you very familiar in the territory. I offer consulation also so you can reach out. http://www.slideshare.net/martonkodok/complex-realtime-event...
24
bio4m 1 day ago 0 replies      
If youre on a tight budget and IO is your main bottleneck it may be easier to purchase a number of decent spec desktop PC's with multiple SSD's in them. SSD's have really come down in price while performance and capacity have improved greatly. Same goes for RAM.(Assumption here is that time is less of a concern than cost at the moment and youre not averse to doing some devops work. Also assuming that the processing youre talking about is some sort of batch processing and not realtime)This way you can try a number of different strategies without blowing the bank on AWS instances (and worst case you have a spare workstation
25
libx 1 day ago 0 replies      
I would consider Unicage for your demands.https://www.youtube.com/watch?v=h_C5GBblkH8https://www.bsdcan.org/2013/schedule/attachments/244_Unicage...

In a shell (modified for speed and ease of use) get, insert, update data in a simple way, without all the fat from other mainstream (Java) solutions.

26
batmansmk 1 day ago 0 replies      
We love those projects at my company (Inovia Team). Your load is not that big. You won't make any big mistake stack-wise, you just have to pick something you already have operated before in production at a smaller scale. Mysql, Postgres, Mongodb, Redis will be totally fine. We have a training on how to insert 1M lines a second with off-the-shelf free open source tools (SQL and NoSQL). Ping us if you are interested by getting the slidedeck.

Tip: focus on how to backup and restore first, the rest will be easy!

27
i_don_t_know 1 day ago 0 replies      
I don't know what I'm talking about or what you need, but I hear kdb is popular in the financial industry because supposedly it can handle large amounts of real-time financial information. http://kx.com
28
nickpeterson 1 day ago 1 reply      
Does the database grow 50GB or is that the size of the text files?
29
jacques_chester 1 day ago 2 replies      
Compare pricing on RDS, if doing it yourself is hurting.

AWS also has Kinesis, which is deliberately intended to be a sort of event drain. Under the hood it uses S3 and they publish an API and an agent that handles all the retry / checkpoint logic for you.

30
hoodoof 1 day ago 0 replies      
I'd start by asking if you are solving the right problem.

Does the business really need exactly this? What is their actual goal? Are they aware of the effort and resources required to get this report?

31
coryrobinson42 1 day ago 0 replies      
I would highly recommend looking into Elasticsearch. Clustering and scalability are its strong points and can help you with your quest.
32
ninjakeyboard 1 day ago 0 replies      
Look at your current solution and check the run plan of your sql. If your data is indexed correctly it shouldn't be too too bad to execute queries. 1M records is about 20 ops to search for a record by key.

If it's modelled in SQL, it's probably relational and normalized so you'll be joining together tables. This balloons the complexity of querying the data pretty fast. Denormalizing data simplifies the problem so see if you can get it into a K/V instead of or relational database. Not saying relational isn't a fine solution - even if you keep it in mysql, denormalizing will benefit the complexity of querying it.

Once you determine if you can denormalize, you can look at sharding the data so instead of having the data in one place, you have it in many places and the key of the record determines where to store and retrieve the data. Now you have the ability to scale your data horizontally across instances to divide the problem's complexity by n where n is the number of nodes.

Unfortunately the network is not reliable so you suddenly have to worry about CAP theorem and what happens when nodes become unavailable so you'll start looking at replication and consistency across nodes and need to figure out with your problem domain what you can tolerate. Eg bank accounts have different consistency requirements than social media data where a stale read isn't a big deal.

Aphyr's Call Me Maybe series has reviews of many datastores in pathological conditions so read about your choice there before you go all in (assuming you do want to look at different stores.) Dynamo style DB's like riak are what I think of immediately but read around - this guy is a wizard. https://aphyr.com/tags/Jepsen

AWS has a notorious network so really think about those failure scenarios. Yes it's hard and the network isn't reliable. Dynamo DBs are cool though and fit the big problems you're looking at if you want to load and query it.

If you want to work with the data, the Apache Spark is worth looking at. You mention mapreduce for instance. Spark is quick.

It's sort of hard because there isn't a lot of information about the problem domain so I can only shoot in the dark. If you have strong consistency needs or need to worry more about concurrent state across data that's a different problem than processing one record at a time without a need for consistent view of the data as a whole. The latter you can just process the data via workers.

But think Sharding to divide the problem across nodes, Denormalization eg via Key/Value lookup for simple runtime complexity. But start where you are - look at your database and make sure it's very well tuned for the queries you're making.

Do you even need to load it into a db? You could distribute the load across clusters of workers if you have some source that you can stream the data from. Then you don't have to load and then query the data. Depends heavily on your domain problem. Good luck. I can email you to discuss if you want - I just don't want to post my email here. Data isn't so much where I hand out as much as processing lots of things concurrently in distributed systems is so others may have better ideas who have gone through similar items.

There are some cool papers like the Amazon Dynamo paper and I read the Google Spanner paper the other day (more globally oriented and around locking and consistency items). You can see how some of the big companies are formalizing thinking by reading some of the papers in that space. Then there are implementations you can actually use but you need to understand them a bit first I think.

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp...http://static.googleusercontent.com/media/research.google.co...

33
faizshah 1 day ago 0 replies      
Note: This is based on solutions I have been researching for a current project and I haven't used these in production.

Short answer: I think you're looking in the wrong direction, this problem isn't solved by a database but a full data processing system like Hadoop, Spark, Flink (my pick), or Google Cloud's dataflow. I don't know what kind of stack you guys are using (imo the solution to this problem is best made leveraging java) but I would say that you could benefit a lot from either using the hadoop ecosystem or using google cloud's ecosystem. Since you say that you are not experienced with that volume of data, I recommend you go with google cloud's ecosystem specifically look at google dataflow which supports autoscaling.

Long answer: To answer your question more directly, you have a bunch of data arriving that needs to be processed and stored every X minutes and needs to be available to be interactively analyzed or processed later in a report. This is a common task and is exactly why the hadoop ecosystem is so big right now.

The 'easy' way to solve this problem is by using google dataflow which is a stream processing abstraction over the google cloud that will let you set your X minute window (or more complex windowing) and automatically scale your compute servers (and pay only for what you use, not what you reserve). For interactive queries they offer google bigquery, a robust SQL based column database that lets you query your data in seconds and only charges you based on the columns you queried (if your data set is 1TB but the columns used in your query are only some short strings they might only charge you for querying 5GB). As a replacement for your mysql problems they also offer managed mysql instances and their own google bigtable which has many other useful features. Did I mention these services are integrated into an interactive ipython notebook style interface called Datalab and fully integrated with your dataflow code?

This is all might get a little expensive though (in terms of your cloud bill), the other solution is to do some harder work involving the hadoop ecosystem. The problem of processing data every X minutes is called windowing in stream processing. Your problems are solved by using Apache Flink, a relatively easy and fast stream processing system that makes it easy to set up clusters as you scale your data processing. Flink will help you with your report generation and make it easy to handle processing this streaming data in a fast, robust, and fault tolerant (that's a lot of buzz words) fashion.

Please take a look at the flink programming guide or the data-artisans training sessions on this topic. Note that the problem of doing SQL queries using flink is not solved (yet) this feature is planned to be released this year. However, flink will solve all your data processing problems in terms of the cross table reports and preprocessing for storage in a relational database or distributed filesystem.

For storing this data and making it available you need to use something fast but just as robust as mysql, the 'correct' solution at this time if you are not using all the columns of your table is using a columnar solution. From googles cloud you have bigquery, from the open source ecosystem you have drill, kudu, parquet, impala and many many more. You can also try using postgres or rethinkdb for a full relational solution or HDFS/QFS + ignite + flink from the hadoop ecosystem.

For the problem of interactively working with your data, try using Apache Zeppelin (free, scala required I think) or Databricks (paid but with lots of features, spark only i think). Or take the results of your query from flink or similar and interactively analyze those using jupyter/ipython(the solution I use).

The short answer is, dust off your old java textbooks. If you don't have a java dev on your team and aren't planning on hiring one, the google dataflow solution is way easier and cheaper in terms of engineering. If you help I do need an internship ;)

If you want to look at all the possible solutions from the hadoop ecosystem look at:https://hadoopecosystemtable.github.io/

For google cloud ecosystem it's all there on their website.

Happy coding!

Oops, it seems I left out ingestion, I would use kafka or spring reactor.

P.S The flink mailing list is very friendly, try asking this question there.

17
Ask HN: What is the simplest way to check if a similar article is on HN?
8 points by chirau  13 hours ago   3 comments top 2
1
ozten 13 hours ago 1 reply      
Use the search feature at the bottom of the page. It is powered by https://hn.algolia.com

Although it isn't fool proof, obviously better than not doing it.

Search for your URL, as well as targeted keywords.

2
HoopleHead 4 hours ago 0 replies      
Searching to see if a story has already been posted?

What an incredible thought! Though, given the noise to signal ratio round here, one that I fear you may be alone in thinking.

18
Ask HN: Does HN move too fast for 'Ask HN'?
48 points by J-dawg  2 days ago   10 comments top 6
1
brudgers 2 days ago 0 replies      
My understanding is that "Ask HN" questions have a different "gravity" and sink more slowly. That said, I suspect that the average quality of an "Ask HN" question is not much better than the average non-spam submission...maybe worse since meta-discussions are fairly common and lead to dull comments like mine here.

Even non-meta questions can be rather lazy...I mean a couple of throw away sentences that don't provide much context suggest that it's probably not that important.

For example: https://news.ycombinator.com/item?id=11160872

Versus: https://news.ycombinator.com/item?id=11149361

While I don't think of "Ask HN" as StackOverflow, there's something to the response "What <code> have you tried?" and the idea that a two sentence question doesn't necessarily deserve a long detailed comprehensive answer.

2
monroepe 2 days ago 1 reply      
While I agree they do get lost quickly, there is an "ask" link in the header. I check there every so often, but maybe I am in the minority.
3
27182818284 2 days ago 0 replies      
The overall quality of Ask HN questions is pretty hit or miss compared with other submissions in the News section. Often times there are Ask HN questions with no context, that border on spam, really aren't asking a question, or would be better served on Stack Overflow.

so I guess what I"m saying is that I'm not particularly surprised by its speed, because a lot of the stories submitted to deserve to decline quickly

4
throwaway21816 2 days ago 1 reply      
Will this Ask HN become its own self fulfilling prophecy?
5
cremno 2 days ago 1 reply      
>One solution would be to give them their own separate 'new' page.

It's not exactly that but https://news.ycombinator.com/ask exists.

6
beamatronic 2 days ago 0 replies      
Yes, absolutely 100% yes.
19
Ask HN: Who will GitHub acquire?
15 points by curiousisgeorge  1 day ago   5 comments top 3
1
saysomethingnow 1 day ago 0 replies      
Probably a good place to start:

https://github.com/integrations

It's an incredibly sparse list though, given GitHub's popularity. Compare this Atlassian's marketplace

https://marketplace.atlassian.com/plugins/app/bitbucket/popu...

which seems to include everybody, both big and small.

2
eugenekolo2 1 day ago 0 replies      
Bitbucket (all of Atlassian) UI designers are still living their IBM days. GitLab is the major competitor I see.
3
csense 1 day ago 1 reply      
Maybe what's going on is they're focusing their resources toward paid offerings for large businesses and away from small and FOSS projects.

The most valuable thing Github has is the enormous portfolio of FOSS projects and all the free eyeballs that come with it. Yes, it's not paid, but because its public offerings are so popular, most developers are familiar with it and that gives it a trusted inlet to basically every software organization in existence. After all, in a non-dysfunctional organization, the guy who decides what products to buy to support their developers should give far more weight to what those developers want, than what is said by a vendor salesman whose job it is to convince them to buy a particular product.

Which means it's a mistake to neglect their FOSS users -- it seems self-evident that the biggest, best, most cost-effective way for Github to sell its paid features is word-of-mouth from its free users. It's very hard for a competitor to replicate their enormous network effect, which is also what makes it so effective. Their underwhelming response to the "dear github" letter suggests that their upper management is blind to this. I think there's a serious possibility that, in the next five to ten years, their position will be as marginalized as, say, Sourceforge is today -- long ago it was the "gold standard" in hosting services, but today it's only used by barely-maintained projects who can't even scrape together the resources needed to change their code hosting provider.

20
Ask HN: Alpine Linux as a Desktop
3 points by smoyer  1 day ago   discuss
21
Ask HN: What new inventions are you working on?
8 points by 8sigma  1 day ago   4 comments top 4
1
egraether 11 hours ago 0 replies      
We just launched a new developer tool named Coati, which is designed to navigate and understand source code: https://www.coati.io/

The idea was based on our experience that you spend too much time searching through code as a developer. Coati makes it much easier to see how the different parts of the software play together.

2
maxaf 1 day ago 0 replies      
IMHO the greatest inventions are those of the highest utility relative to cost of implementation. Such inventions usually arise from tenaciously scratching a particularly bothersome itch.

For example, I've seen many Scala programmers struggle with achieving maximal type safety due to getting bogged down in boilerplate. To attack this problem (as I see it anyway, but that's all which matters for now) I'm researching ways of utilizing Scala macros to cut down on such boilerplate while retaining type safety.

This has been huge for my own side work (which means the utility is definitely there), but I'm still uncertain as to the cost, i.e. what effects my approach might have on a "serious" commercial project. I'm just going to have to find out the old fashioned way.

3
kiloreux 18 hours ago 0 replies      
Although not me, but I have played the role of the engineer in the research going on currently in our lab for developing a more advanced brain computer interface, to help people in need that can't move their bodies, we're doing pretty well, I will share the realization soon.
4
miguelrochefort 20 hours ago 0 replies      
I'm building a new communication interface for humans and machines.
22
Ask HN: Why are there no glucose measurement sensors?
8 points by danielschonfeld  1 day ago   8 comments top 3
1
1123581321 1 day ago 1 reply      
There's a lot more activity around intercepting output from sensor-transmitter combos like Dexcom and building a better receiver, or looping the data into a pump. Take a look at this, for example: http://www.nightscout.info/

A DIY sensor needs to either be some kind of test strip or a needle. Both are a lot easier to just get through insurance than to mimic.

I'm sure you could perform the chemical reactions yourself, but you'll probably find whatever you make will need to be replaced frequently. Meanwhile, the software, modified Android phones, etc. last a long longer, so that's where is the action.

2
HarryHirsch 1 day ago 1 reply      
What? Glucose test strips have been around since the 1960s, and they all use the same principle - coulombmetry using the glucose oxidase/horseradish peroxidase system.

The challenge for the homebrewer is to build something traceable. When you measure blood sugar today the same sample should yield the same number - not only today but also ten years from now. Yes, Theranos is fighting with traceability, too.

3
Raed667 1 day ago 0 replies      
There are plenty of "connected" GLUCO-MONITORING systems. A simple Google search with those 3 keywords shows plenty of results.
23
Ask HN: Is spending a year learning from MOOCs a good plan?
11 points by karolisram  1 day ago   9 comments top 4
1
JoachimSchipper 1 day ago 1 reply      
If you want to work in the commercial world, get a job - you have a lot of (practical) skills to develop, and people will wonder why you needed another year of coursework before starting work. Consider aiming for a job at a company that's good at an interesting niche, and note that new grads can get hired into many niches. There is such a thing as a junior data scientist, junior security expert, ...! (And there's also such a thing as a web developer working at a robotics company - such a web developer could easily dip a toe into robotics.)

It's not necessarily a bad idea to play around with MOOCs to take a look at other skills while working at your job, but if a university course wasn't enough to make up your mind a MOOC won't be either.

(In general, a thesis or internship is a good way to look deeply into a specific niche. But if you're three months from graduating, I don't think that advice will help you...)

2
brudgers 1 day ago 1 reply      
{Random advice from the internet}

More schooling will teach more of the sort of things that are taught in school. A job will teach the sort of things that are taught on a job.

If the priority is learning more of the things that are taught in school, get a graduate degree. There is nothing to keep a person from spending time on MOOC's while working. A lot of people do.

Work is not like school, most new grads will learn a lot really fast because the learning is by doing alongside other people with more experience in a culture with a great deal of institutional experience in the thing that is being done.

To put it another way, if there's some really interesting MOOC, take it now.

Good luck.

3
kliao 1 day ago 1 reply      
Why not ask professors at your university for opportunities on campus to do research in the fields you're interested in? This would be infinitely better than only doing MOOCs on your own (which you can also do on the side) because you are 1) receiving guidance from professors and/or grad students, 2) possibly getting paid, 3) have something concrete to put on your resume.

The thing with doing MOOCs on your own is it seems easy to get lazy and end up wasting a lot of time with nothing to show on your resume.

4
runT1ME 1 day ago 1 reply      
Here's what you do. Find a couple companies you want to work for. Find folks who work there on twitter/irc/github.

Email them and say "I'd love to work at your company in a year, what can I do to prepare".

They'll tell you. You do want to get a job, be it contracting or something, so you have some professional experience.

But the main thing is you learn what you need to learn, you figure out if you need some github contributions, moocs, read a books, etc.

24
Ask HN: Does anyone else, other than myself, need a secure note app?
6 points by VuongN  1 day ago   7 comments top 4
1
kjksf 1 day ago 1 reply      
I also work on a note-taking application and my thoughts are: ability to have secure notes is somewhat important but it's not important enough to make it a primary benefit.

Your app would have to be great at non-secure notes AND have an option to add secure notes.

Evernote, btw, does support secure (encrypted) notes. They have a lousy UI for them but the option is there.

If you don't think that your app is better in at least some ways than existing note-taking apps, that having secure notes will not make a difference.

2
vldx 1 day ago 1 reply      
Slightly different, but however - I've been journaling daily for the past 4 years. It turned out it's a healthy habit. I'm writing to myself and I would be really happy knowing that my thoughts are secure. There's nothing to hide, but knowing that probably helps you being more direct and upfront with yourself. As far as I know, Day One are planning to roll out encryption this year.
3
CiPHPerCoder 18 hours ago 0 replies      
I've always just used encrypted pastebins, e.g. https://defuse.ca/b/
4
duncan_bayne 1 day ago 1 reply      
Yes. But I rolled my own out of pre-existing software:

* Emacs (desktop / laptop editor)

* Orgzly (Android editor)

* org-mode (the note mechanism itself)

* Unison (for file sync)

* Ubuntu LTS + OpenSSH (on the file server)

Happy to provide more detail if you're interested.

25
Ask HN: Is there a free/fremium hash table in the cloud with simple HTTP access?
7 points by THRWAWA20160222  2 days ago   9 comments top 4
1
namtao 1 day ago 0 replies      
It does! I wanted exactly the same and couldn't find something simple enough, so I made it (last month):

Stord.io is a key/value store. This is often modelled as a hashmap or a dictionary in programming languages.

Under the hood, stord.io is powered by Redis, with a thin python application wrapper, based on Flask.Stord.io doesnt assume anything about your data, make whatever nested schema you want!

http://stord.io

Full disclosure: if this wasn't already clear, it's my project. I would LOVE feedback/feature suggestions.

2
bifrost 2 days ago 0 replies      
I have seen a couple variants, but none of them have stuck around for long since they ended up being CnC for botnets/malware/etc.

I think it would be safe to assume there are also collision problems in unauthenticated ones as well...

3
xyzzy123 2 days ago 2 replies      
I'm not aware of any services with the simple API you're looking for (neat idea), but there are a lot of more complicated solutions.

What are the key/value durability requirements? (OK to drop values now and then, or does it need to keep them until the end of time?). Need backups? Do values expire, or do you have to expire them manually? Since you can't enumerate or search, how do you delete things? Allowed sizes of keys and values, between bytes and terabytes? How far should it scale? Shared namespace, or namespace per user? Do you need a latency guarantee? How low? Are you gonna use it for something important and need an SLA on the availability of the service as a whole?

A couple of "nearby" points in the solution space:

Amazon S3 is a KV store where the keys look like filenames and the values look like files. High durability, good scaling, pretty high latency. You could also obviously paper a KV store on top of ElastiCache or DynamoDB, which are going to have different properties.

Going low-level and implementing your own in say, golang would probably be the most fun though :p

Hard to say if we could use a SAAS KV store at work without a lot more technical detail on the solution. I'm having a hard time thinking of an app where you'd want a KV store, but not need a database or NoSQL store which you could use instead.

4
mike255x 2 days ago 1 reply      
You can use any REDIS service on the cloud. An example: https://redislabs.com/. If you are on Heroku there are multiple REDIS plugins.
26
Submitting to DMOZ
4 points by mschenkel  1 day ago   4 comments top 3
1
flignats 19 hours ago 0 replies      
Dmoz doesn't really add new sites these days...

Unless you can find an editor for the category you're trying to submit too.

2
mhoad 1 day ago 1 reply      
Assuming you're doing this for SEO purposes in which case I would say honestly don't bother. As someone who managed SEO campaigns for years for Fortune 500 types I assure you that you're wasting your time with it.
3
blairanderson 1 day ago 0 replies      
27
Ask HN: Dealing with users who share child porn?
14 points by chejazi  2 days ago   10 comments top 5
1
NameNickHN 2 days ago 1 reply      
I run a couple of short URL websites and this kind of short URLs along with links to phishing and spam sites is what you'll get on a very regular basis in this business.

Those short URLs are often being used in mass mailings, guestbooks, comment forms etc. and they'll get reported pretty quickly to URL blacklists. That gives you the chance to disable the shady short URls before too many people come across them.

Here is a list of URL blacklist providers that you should check each URL against, both before accepting it into your database and again before you deliver it to the user:

multi.surbl.org, uribl.swinog.ch, dbl.spamhaus.org, url.rbl.jp, uribl.spameatingmonkey.net, iadb.isipp.com, dnsbl.sorbs.net

You can find out more about this stuff on http://www.surbl.org/

Since I've implemented these checks I only receive complaint emails every other month and no contact from the authorities so far (in contrast to the disposable email services I run).

When you get a complaint about a URL, your software should allow you to disable all the URLs of the reported domain. There are times when the spammers have taken over a forum (or any site really) and created a large amount of spammy content.

Hope this helps.

2
bifrost 2 days ago 1 reply      
The EFF may be able to help you.I would prepare a document with the logs/info that you have in addition to a copy of your privacy and data retention policy. You may also want to doublecheck to see if the submitter IP is a Tor node, that'll make everyone's lives easier. I believe the FBI has a special taskforce for this should you contact them.
3
brudgers 2 days ago 1 reply      
{Random advice from the internet}

My first thought is to wonder why someone would want to be in a space where an arms race against people devoted to child porn was both necessary and constant. Particularly when that space wasn't providing capital sufficient to retain a lawyer.

My gut tells me that as a matter of will, it's going to be hard to keep this class of activity from reoccurring regardless of how this incident works out and that scaling the service will scale the headaches of monitoring and policing user behavior along with it. Personally, I'd rather deal with users I liked. YMMV.

Good luck.

4
Spooky23 1 day ago 1 reply      
If you store photos, Microsoft has a free product called photodna that will match filed against illegal images.

Take a look at the Center for Missing and Exploited children and perhaps call them. They operate a tip line to report this kind of activity. Sad that we need to think about these things.

5
max_ 20 hours ago 0 replies      
you can use https://davidwalsh.name/nudejs to detect their posts and then: Ban them!!
28
Ask HN: What are your thoughts on pair programming?
10 points by aprdm  2 days ago   11 comments top 10
1
isaachawley 2 days ago 0 replies      
I worked at a place that had pairing stations at separate tables in the team area. We were supposed to pair when doing any coding.

In practice we paired when doing difficult stuff. It worked well.

Tips:

Don't pair at your desk. Use a pairing station.

Pair for a task, short if possible. If not, timebox.

Pair for short periods!

Train yourself to be productive while pairing. Don't relax. A short period of hard work.

Pick a consistent pairing time and make it a habit. Such as after standup, after lunch, whatever. Some times when you would not be productive solo, like after lunch, you can be productive in a pair. Experiment.

Failure modes:

Long, unproductive, exhausting pairing periods. People will quit.

That one guy that nobody wants to pair with. Happens, sorry.

Sitting there twiddling your thumbs while your pair partner writes an email. Only pair for coding / debugging / testing! Don't pair at your desk.

2
colund 1 day ago 0 replies      
I used to work at a company where we did a lot of pair programming and enjoyed it a lot.

As a fast touch typer I preferred to be the one at the keyboard. I sometimes like to sketch things out first to test my initial idea.

I think pair programming can be extremely effective and fun when done in an ambitious work environment with open minded people who can work well with others. I much prefer doing pair programming than not since it's an opportunity to discuss requirements/design/solutions and also a kind of code review before the actual code review.

Coding on your own hides wrong assumptions, incorrect thinking, lack of teamwork/collaboration, misunderstandings and code quality issues which could likely to be found in pair programming sessions before a code review. It can be a great synergy when trying to solve complex problems where two people balance up different aspects of problem solving.

3
brianwawok 2 days ago 0 replies      
It is cool for a really hard algo or for bringing up a new guy to speed. To do it 100% of the time? That is torture. I would not work in those conditions.
4
cableshaft 2 days ago 0 replies      
If someone has questions that are hard to answer without having the machine in front of them to point things out, sure, I or they will pull up a chair, and we'll both take a look at it.

But all day pair programming would drive me crazy.

5
globba22 2 days ago 1 reply      
I know a lot of great devs who think very highly of PP, but as for me, it doesn't work well at all.

I just haven't had a great experience with it, though I remain open to trying it again.

6
_RPM 2 days ago 0 replies      
Some people think pair programming consists of one person sitting with another person at their disk taking over their workstation and having them just watch. This is actually horrible and hurts the body because looking at the screen for a long time in a weird position aches. I had this happen at an internship of mine before. I loathe that experience.
7
hacknat 2 days ago 0 replies      
Sure I'll do it...on a white board to solve a problem, or reviewing a spec.

Pair programming glorifies the task of programming as overly difficult. Solving problems can be difficult, programming them shouldn't be. Solving a tough problem in a code editor is not a good idea, IMO.

8
eecks 2 days ago 0 replies      
I wouldn't like mandated pair programming but I find it always productive to pair up with someone* (informally) when solving a new/difficult problem.

* someone I know and like.. I imagine being paired with a person you disliked would be awful

9
binarysoul 2 days ago 0 replies      
I think it is good in scenarios where the problems are sufficiently difficult.
10
distracted828 1 day ago 0 replies      
It is something I've wished I could do. In fact, I am considering only working at a company that does pair programming.
29
Ask HN: Has Google got rid of apps from Hangouts?
7 points by micheleb  2 days ago   1 comment top
1
LordDragonfang 2 days ago 0 replies      
Google has a somewhat annoying tendency to release "new" interface versions for its various products that are far from feature complete, and only slowly add back the features the old version had - or sometime not add them. Presumably, they'll be added eventually, but only time will tell.
30
Ask HN: Are there any research journals that are free for computer science?
2 points by gravypod  1 day ago   2 comments top 2
1
eivarv 1 day ago 0 replies      
While not open journals per se, many (new) papers will typically be available at academic homepages (preprints), via "All [X] versions"-links in a google scholar [0] search, or on arXiv [1], depending on the subfield.

For instance, the machine learning group (Yoshua Bengio, Ian Goodfellow, et al.) at the University of Montreal (the people behind open source software like Theano [2], Pylearn 2 [3], etc.) regularly makes papers available on arXiv, and is currently working on a book about Deep Learning, drafts of which are freely available [4].

[0]: http://scholar.google.com[1]: http://arxiv.org[2]: https://github.com/Theano/Theano[3]: https://github.com/lisa-lab/pylearn2[4: http://www.deeplearningbook.org

2
fundamental 1 day ago 0 replies      
I'm not familiar with the names of the particular CS journals as that's not my research area, but it's relatively easy to find the type of journals you're looking for. Just search for "open access computer science" and you'll find plenty of journals with free access to articles.
       cached 26 February 2016 13:05:02 GMT